[Bug c/40363] New: Nonoptimal save/restore registers
IMHO, current save/restore registers strategy is not optimal. Look: # cat test.c #include stdio.h void print(char *mess, char *format, int text) { printf(mess); printf(format,text); } void main() { print(X=,%d\n,1); } # gcc --version gcc (GCC) 4.5.0 20090601 (experimental) # gcc -o test test.c -O2 # objdump -d test 004004d0 print: 4004d0: 48 89 5c 24 f0 mov%rbx,-0x10(%rsp) 4004d5: 48 89 6c 24 f8 mov%rbp,-0x8(%rsp) 4004da: 48 89 f3mov%rsi,%rbx 4004dd: 48 83 ec 18 sub$0x18,%rsp 4004e1: 89 d5 mov%edx,%ebp 4004e3: 31 c0 xor%eax,%eax 4004e5: e8 ce fe ff ff callq 4003b8 pri...@plt 4004ea: 89 ee mov%ebp,%esi 4004ec: 48 89 dfmov%rbx,%rdi 4004ef: 48 8b 6c 24 10 mov0x10(%rsp),%rbp 4004f4: 48 8b 5c 24 08 mov0x8(%rsp),%rbx 4004f9: 31 c0 xor%eax,%eax 4004fb: 48 83 c4 18 add$0x18,%rsp 4004ff: e9 b4 fe ff ff jmpq 4003b8 pri...@plt = Let's replace current save/restore: 48 89 5c 24 f0 mov%rbx,-0x10(%rsp) 48 89 6c 24 f8 mov%rbp,-0x8(%rsp) 48 83 ec 18 sub$0x18,%rsp ... 48 8b 6c 24 10 mov0x10(%rsp),%rbp 48 8b 5c 24 08 mov0x8(%rsp),%rbx 48 83 c4 18 add$0x18,%rsp to faster and short new save/restore: 55 push %rbp 53 push %rbx 53 push %rbx ; dummy push ... 5b pop%rbx ; dummy pop 5b pop%rbx 5d pop%rbp IMPOTANT note: For faster execution, dummy push have to use same register as previous push! Measurement results on Core2: new save/restore 5 ticks faster then carrent one. Regards, Vladimir Volynsky -- Summary: Nonoptimal save/restore registers Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vvv at ru dot ru http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40363
[Bug target/40171] GCC does not pass -mtune and -march options to assembler!
--- Comment #4 from vvv at ru dot ru 2009-05-25 19:54 --- (In reply to comment #2) This is very odd? What is the assembler doing that the compiler isn't? There are exist some optimizations impossible without exact knowledge of address and opcodes, One example avoiding of branch mispredicts - http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942 Other example - Ensure instructions using 0xF7 opcode byte does not start at offset 14 of a fetch line... Unfortunately, current version GNU AS cat't do this optimizations. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40171
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #49 from vvv at ru dot ru 2009-05-20 21:38 --- (In reply to comment #48) How this patches work? Is it required some special options? # /media/disk-1/B/bin/gcc --version gcc (GCC) 4.5.0 20090520 (experimental) # cat test.c void f(int i) { if (i == 1) F(1); if (i == 2) F(2); if (i == 3) F(3); if (i == 4) F(4); if (i == 5) F(5); } extern int F(int m); void func(int x) { int u = F(x); while (u) u = F(u)*3+1; } # /media/disk-1/B/bin/gcc -o t test.c -O2 -c -mtune=k8 # objdump -d t f: 0: 83 ff 01cmp$0x1,%edi 3: 74 1b je 20 f+0x20 5: 83 ff 02cmp$0x2,%edi 8: 74 16 je 20 f+0x20 a: 83 ff 03cmp$0x3,%edi d: 74 11 je 20 f+0x20 f: 83 ff 04cmp$0x4,%edi 12: 74 0c je 20 f+0x20 14: 83 ff 05cmp$0x5,%edi 17: 74 07 je 20 f+0x20 19: f3 c3 repz retq 1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 20: 31 c0 xor%eax,%eax 22: e9 00 00 00 00 jmpq 27 f+0x27 27: 66 0f 1f 84 00 00 00nopw 0x0(%rax,%rax,1) 2e: 00 00 0030 func: 30: 48 83 ec 08 sub$0x8,%rsp 34: e8 00 00 00 00 callq 39 func+0x9 39: 85 c0 test %eax,%eax 3b: 89 c7 mov%eax,%edi 3d: 74 0e je 4d func+0x1d 3f: 90 nop 40: e8 00 00 00 00 callq 45 func+0x15 45: 8d 7c 40 01 lea0x1(%rax,%rax,2),%edi 49: 85 ff test %edi,%edi 4b: 75 f3 jne40 func+0x10 4d: 48 83 c4 08 add$0x8,%rsp 51: c3 retq I can't see any padding in function f :( PS. In file config/i386/i386.c (ix86_avoid_jump_mispredicts) /* Look for all minimal intervals of instructions containing 4 jumps. ... Not jumps, but _branches_ (CALL, JMP, conditional branches, or returns) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug c/40171] New: GCC does not pass -mtune and -march options to assembler!
GNU Assembler support optimization options, but GCC does not pass -mtune and -march options to assembler! For full optimization it's required to use this twice: # gcc ... -mtune=core2 -Wa,-mtune=core2 There is no default passing optimization options from GCC to AS. But many programmers imply that passing. Because it's very strange to optimize code on GCC-level and do not optimize on assembler level. Even Linux kernel use -march without -Wa,-march. PS. CCing to v...@ru.ru, please. -- Summary: GCC does not pass -mtune and -march options to assembler! Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vvv at ru dot ru http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40171
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #30 from vvv at ru dot ru 2009-05-14 09:01 --- Created an attachment (id=17863) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863action=view) Testing tool. Here is results of my testing. Code: align 128 test_cikl: rept 14 ; 14 if SH=0, 15 if SH=1, 16 if SH=2 { nop } cmp al,0 ; 2 bytes jz $+10h+NOPS ; 2 bytes offset=0 cmp al,1 ; 2 bytes offset=2 jz $+0Ch+NOPS ; 2 bytes offset=4 cmp al,2 ; 2 bytes offset=6 jz $+08h+NOPS ; 2 bytes offset=8 cmp al,3 ; 2 bytes offset=A match =1, NOPS { nop } match =2, NOPS { xchg eax,eax ; 2-bytes NOP } jz $+04h ; 2 bytes offset=C ja $+02h ; 2 bytes offset=E mov eax,ecx and eax,7h loop test_cikl This code tested on Core2,Xeon and P4 CPU. Results in RDTSC ticks. ; Core 2 Duo ;NOPS/tick/Max NOPS/tick/MaxNOPS/tick/Max ; SH=0 0/571/729 1/306/594 2/315/630 ; SH=1 0/338/612 1/338/648 2/339/648 ; SH=2 0/339/666 1/339/675 2/333/693 ; Xeon 3110 ;NOPS/tick/Max NOPS/tick/MaxNOPS/tick/Max ; SH=0 0/586/693 1/310/675 2/310/675 ; SH=1 0/333/657 1/330/648 2/464/630 ; SH=2 0/333/657 1/470/594 2/474/603 ; P4 ;NOPS/tick/Max NOPS/tick/MaxNOPS/tick/Max ; SH=0 0/1027/1317 1/1094/1258 2/1028/1207 ; SH=1 0/1151/1377 1/1068/1352 2/902/1275 ; SH=2 0/1124/1275 1/1148/1335 2/979/1139 Conclusion: 1. Core2 and Xeon - similar results. P4 - something strange. For Core2 Xeon padding very effective. Code with padding almoust 2 times faster. No sence for P4? 2. My previous sentence VVV 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),but VVV Intel limitation for 16-bytes chunk (memory range - +10h) is wrong. At leat for Core2 Xeon. For this CPU 16-bytes chunk means memory range XXX0 - XXXF. Unfortunately, I can't test AMD. PS. My testing tool in attachmen. It start under MSDOS, switch to 32-bit mode, switch to 64-bit mode and measure rdtsc ticks for test code. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #34 from vvv at ru dot ru 2009-05-14 19:43 --- (In reply to comment #32) Please make sure that you only test nop paddings for branch insns, not nop paddings for branch targets, which prefer 16byte alignment. Additional tests (for Core2) results: 1. Execution time don't depend on paddings for branch target. 2. Execution time don't depend on position of NOP within 16-byte chunk with 4 branch. Even if NOP inserted between CMP and conditional jump. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #19 from vvv at ru dot ru 2009-05-13 11:42 --- (In reply to comment #18) No, .p2align is the right thing to do, given that GCC doesn't have 100% accurate information about instruction sizes (for e.g. inline asms it can't have, for stuff where branch shortening can decrease the size doesn't have it until the shortening branch phase which is too late for this machine reorg, and in other cases the lengths are just upper bounds). Say .p2align 16,,5 says insert a nop up to 5 bytes if you can reach the 16-byte boundary with it, otherwise don't insert anything. But that necessarily means that there were less than 11 bytes in the same 16 byte page and if the lower bound insn size estimation determined that in 11 bytes you can't have 3 branch changing instructions, you are fine. Breaking of fused compare and jump (32-bit code only) is unfortunate, but inserting it before the cmp would mean often unnecessarily large padding. You are rigth, if padding required for every 16-byte page with 4 branches on it. But Intel writes about 16-byte chunk, not 16-byte page. Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual: Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in a 16-byte chunk. IMHO, here chunk - memory range from x to x+10h, where x - _any_ address. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #21 from vvv at ru dot ru 2009-05-13 17:13 --- I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM processors, but it is nonoptimal for Intel processors. Because: 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but Intel limitation for 16-bytes chunk (memory range - +10h) 2. AMD - maximum of _THREE_ near branches (CALL, JMP, conditional branches, or returns), Intel - maximum of _FOUR_ branches! Quotation from Software Optimization Guide for AMD64 Processors 6.1 Density of Branches When possible, align branches such that they do not cross a 16-byte boundary. The AMD AthlonTM 64 and AMD OpteronTM processors have the capability to cache branch-prediction history for a maximum of three near branches (CALL, JMP, conditional branches, or returns) per 16-byte fetch window. A branch instruction that crosses a 16-byte boundary is counted in the second 16-byte window. Due to architectural restrictions, a branch that is split across a 16-byte boundary cannot dispatch with any other instructions when it is predicted taken. Perform this alignment by rearranging code; it is not beneficial to align branches using padding sequences. The following branches are limited to three per 16-byte window: jcc rel8 jcc rel32 jmp rel8 jmp rel32 jmp reg jmp WORD PTR jmp DWORD PTR call rel16 call r/m16 call rel32 call r/m32 Coding more than three branches in the same 16-byte code window may lead to conflicts in the branch target buffer. To avoid conflicts in the branch target buffer, space out branches such that three or fewer exist in a given 16-byte code window. For absolute optimal performance, try to limit branches to one per 16-byte code window. Avoid code sequences like the following: ALIGN 16 label3: call label1 ; 1st branch in 16-byte code window jc label3 ; 2nd branch in 16-byte code window call label2 ; 3rd branch in 16-byte code window jnzlabel4 ; 4th branch in 16-byte code window ; Cannot be predicted. If there is a jump table that contains many frequently executed branches, pad the table entries to 8 bytes each to assure that there are never more than three branches per 16-byte block of code. Only branches that have been taken at least once are entered into the dynamic branch prediction, and therefore only those branches count toward the three-branch limit. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #25 from vvv at ru dot ru 2009-05-13 18:56 --- (In reply to comment #22) CCing H.J for Intel optimization issues. VVV 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but VVV Intel limitation for 16-bytes chunk (memory range - +10h) I have a doubt about this now. Sanks to Richard Guenther (Comment #20). So I am going to make measurements for check it for Core2. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #26 from vvv at ru dot ru 2009-05-13 19:05 --- (In reply to comment #23) Note that we need something that works for the generic model as well, which in this case looks like it is the same as for AMD models. There is processor property TARGET_FOUR_JUMP_LIMIT, may be create new one - TARGET_FIVE_JUMP_LIMIT? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #28 from vvv at ru dot ru 2009-05-13 19:18 --- (In reply to comment #24) Using padding to avoid 4 branches in 16byte chunk may not be a good idea since it will increase code size. It's enough only one byte NOP per 16-byte chunk for padding. But, IMHO, four branches in 16 byte chunk - is very-very infrequent. Especially for 64-bit mode. BTW, it's difficult to understand, what Intel mean ander term branch. Is it CALL, JMP, conditional branches, or returns (same as AMD), or only JMP and conditional branches. I beleave last case right. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #17 from vvv at ru dot ru 2009-05-12 16:40 --- (In reply to comment #16) Created an attachment (id=17783) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783action=view) [edit] gcc45-pr39942.patch Patch that attempts to take into account .p2align directives that are emitted for (some) CODE_LABELs and also the gen_align insns that the pass itself inserts. For a CODE_LABEL, say .p2align 16,,10 means either that the .p2align directive starts a new 16 byte page (then insns before it are never interesting), or nothing was skipped because more than 10 bytes would need to be skipped. But that means the current group could contain only 5 or less bytes of instructions before the label, so again, we don't have to look at instructions not in the last 5 bytes. Another fix is that for MAX_SKIP 7, ASM_OUTPUT_MAX_SKIP_ALIGN shouldn't emit the second .p2align 3, which might (and often does) skip more than MAX_SKIP bytes (up to 7). Nice path. Code looks better. It checked on Linux kernel 2.6.29.2. But 2 notes: 1.There is no garanty that .p2align will be translated to NOPs. Example: # cat test.c void f(int i) { if (i == 1) F(1); if (i == 2) F(2); if (i == 3) F(3); if (i == 4) F(4); if (i == 5) F(5); } # gcc -o test.s test.c -O2 -S # cat test.s .file test.c .text .p2align 4,,15 .globl f .type f, @function f: .LFB0: .cfi_startproc cmpl$1, %edi je .L7 cmpl$2, %edi je .L7 cmpl$3, %edi je .L7 cmpl$4, %edi .p2align 4,,5--- attempt of padding je .L7 cmpl$5, %edi je .L7 rep ret .p2align 4,,10 .p2align 3 .L7: xorl%eax, %eax jmp F .cfi_endproc .LFE0: .size f, .-f .ident GCC: (GNU) 4.5.0 20090512 (experimental) .section.note.GNU-stack,,@progbits # gcc -o test.out test.s -O2 -c # objdump -d test.out f: 0: 83 ff 01cmp$0x1,%edi 3: 74 1b je 20 f+0x20 5: 83 ff 02cmp$0x2,%edi 8: 74 16 je 20 f+0x20 a: 83 ff 03cmp$0x3,%edi d: 74 11 je 20 f+0x20 f: 83 ff 04cmp$0x4,%edi 12: 74 0c je 20 f+0x20 no NOP here 14: 83 ff 05cmp$0x5,%edi 17: 74 07 je 20 f+0x20 19: f3 c3 repz retq IMHO, better to insert not .p2align, but NOPs directly. ( I mean line - emit_insn_before (gen_align (GEN_INT (padsize)), insn); ) 2. IMHO, it's bad idea to insert somthing between CMP and conditional jmp. Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual 3.4.2.2 Optimizing for Macro-fusion Macro-fusion merges two instructions to a single μop. Intel Core Microarchitecture performs this hardware optimization under limited circumstances. The first instruction of the macro-fused pair must be a CMP or TEST instruction. This instruction can be REG-REG, REG-IMM, or a micro-fused REG-MEM comparison. The second instruction (adjacent in the instruction stream) should be a conditional branch. So if we need to insert NOPs, better to do it _before_ CMP. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug c/40093] New: Optimization by functios reordering.
Because memory controller prefetch memory blocks, execution time of functions calls sequence depend on order this functions in memory. For example: 4 calls: call func1 call func2 call func3 call func4 faster in case of direct functions order in memmory: .p2align 4 func1: ret .p2align 4 func2: ret .p2align 4 func3: ret .p2align 4 func4: ret and slow in case inverse order: .p2align 4 func4: ret .p2align 4 func3: ret .p2align 4 func2: ret .p2align 4 func1: ret Unfortunately, inverse order is typical for C/C++. what do you think about this kind optimization? -- Summary: Optimization by functios reordering. Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vvv at ru dot ru http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40093
[Bug c/40093] Optimization by functios reordering.
--- Comment #1 from vvv at ru dot ru 2009-05-10 16:43 --- Created an attachment (id=17847) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17847action=view) Example direct/inverse calls Simple example. RDTSC ticks for direct and inverse sequence of calls. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40093
[Bug middle-end/40093] Optimization by functios reordering.
--- Comment #3 from vvv at ru dot ru 2009-05-10 18:08 --- (In reply to comment #2) This should have been done already with cgraph order. Unfortunately, I can see inverse order only in separate source file. Inverse but not optimized. Example: // file order1.c #include stdio.h main(int argc, char **argv) {int i,j,k,l; i=func1(); j=func2(); k=func3(); l=func4(); printf(%d %d %d %d\n,i,j,k,l); } = // file order2.c int func1(){ return(F(4));} int func2(){ return(F(3));} int func3(){ return(F(2));} int func4(){ return(F(1));} = // file order3.c int F(int x){ return(x);} # gcc --version gcc (GCC) 4.5.0 20090508 (experimental) # gcc -o order order3.c order2.c order1.c -O2 # objdump -d order 00400520 F: 400520: 89 f8 mov%edi,%eax 400522: c3 retq 00400530 func4: 400530: bf 01 00 00 00 mov$0x1,%edi 400535: 31 c0 xor%eax,%eax 400537: e9 e4 ff ff ff jmpq 400520 F 00400540 func3: 400540: bf 02 00 00 00 mov$0x2,%edi 400545: 31 c0 xor%eax,%eax 400547: e9 d4 ff ff ff jmpq 400520 F 00400550 func2: 400550: bf 03 00 00 00 mov$0x3,%edi 400555: 31 c0 xor%eax,%eax 400557: e9 c4 ff ff ff jmpq 400520 F 00400560 func1: 400560: bf 04 00 00 00 mov$0x4,%edi 400565: 31 c0 xor%eax,%eax 400567: e9 b4 ff ff ff jmpq 400520 F 00400570 main: 400570: 48 89 5c 24 e8 mov%rbx,-0x18(%rsp) 400575: 48 89 6c 24 f0 mov%rbp,-0x10(%rsp) 40057a: 31 c0 xor%eax,%eax 40057c: 4c 89 64 24 f8 mov%r12,-0x8(%rsp) 400581: 48 83 ec 18 sub$0x18,%rsp 400585: e8 d6 ff ff ff callq 400560 func1 40058a: 89 c3 mov%eax,%ebx 40058c: 31 c0 xor%eax,%eax 40058e: e8 bd ff ff ff callq 400550 func2 400593: 89 c5 mov%eax,%ebp 400595: 31 c0 xor%eax,%eax 400597: e8 a4 ff ff ff callq 400540 func3 40059c: 41 89 c4mov%eax,%r12d 40059f: 31 c0 xor%eax,%eax 4005a1: e8 8a ff ff ff callq 400530 func4 4005a6: 44 89 e1mov%r12d,%ecx 4005a9: 41 89 c0mov%eax,%r8d 4005ac: 89 ea mov%ebp,%edx 4005ae: 89 de mov%ebx,%esi 4005b0: 48 8b 6c 24 08 mov0x8(%rsp),%rbp 4005b5: 48 8b 1c 24 mov(%rsp),%rbx 4005b9: 4c 8b 64 24 10 mov0x10(%rsp),%r12 4005be: bf bc 06 40 00 mov$0x4006bc,%edi 4005c3: 31 c0 xor%eax,%eax 4005c5: 48 83 c4 18 add$0x18,%rsp 4005c9: e9 42 fe ff ff jmpq 400410 pri...@plt = But optimal: 00400520 main: 400520: 48 89 5c 24 e8 mov%rbx,-0x18(%rsp) 400525: 48 89 6c 24 f0 mov%rbp,-0x10(%rsp) 40052a: 31 c0 xor%eax,%eax 40052c: 4c 89 64 24 f8 mov%r12,-0x8(%rsp) 400531: 48 83 ec 18 sub$0x18,%rsp 400535: e8 46 00 00 00 callq 400580 func1 40053a: 89 c3 mov%eax,%ebx 40053c: 31 c0 xor%eax,%eax 40053e: e8 4d 00 00 00 callq 400590 func2 400543: 89 c5 mov%eax,%ebp 400545: 31 c0 xor%eax,%eax 400547: e8 54 00 00 00 callq 4005a0 func3 40054c: 41 89 c4mov%eax,%r12d 40054f: 31 c0 xor%eax,%eax 400551: e8 5a 00 00 00 callq 4005b0 func4 400556: 44 89 e1mov%r12d,%ecx 400559: 41 89 c0mov%eax,%r8d 40055c: 89 ea mov%ebp,%edx 40055e: 89 de mov%ebx,%esi 400560: 48 8b 6c 24 08 mov0x8(%rsp),%rbp 400565: 48 8b 1c 24 mov(%rsp),%rbx 400569: 4c 8b 64 24 10 mov0x10(%rsp),%r12 40056e: bf bc 06 40 00 mov$0x4006bc,%edi 400573: 31 c0 xor%eax,%eax 400575: 48 83 c4 18 add$0x18,%rsp 400579: e9 92 fe ff ff jmpq 400410 pri...@plt 00400580 func1: 400580: bf 01 00 00 00 mov$0x1,%edi 400585: 31 c0
[Bug middle-end/40093] Optimization by functios reordering.
--- Comment #5 from vvv at ru dot ru 2009-05-10 18:20 --- (In reply to comment #4) Well you need whole program to get the behavior which you want. Yes. Of course, it's no problem for small single-programmer project, but it's problem for big projects like Linux Kernel. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40093
[Bug c/40072] New: Nonoptimal code - CMOVxx %eax,%edi; mov %edi,%eax; retq
Sometimes GCC generate code at end of function: cmovge %eax,%edi mov%edi,%eax retq but faster: cmovl %edi,%eax retq Example: # cat test.c #define MX 0 #define LIM 7 char char_char(char m) {if(mLIM) return(MX); return(m);} char char_int(int m) {if(mLIM) return(MX); return(m);} char char_uint(unsigned int m) {if(mLIM) return(MX); return(m);} char char_long(long m) {if(mLIM) return(MX); return(m);} char char_ulong(unsigned long m) {if(mLIM) return(MX); return(m);} int int_char(char m) {if(mLIM) return(MX); return(m);} int int_int(int m) {if(mLIM) return(MX); return(m);} // Nonoptimal int int_uint(unsigned int m) {if(mLIM) return(MX); return(m);} int int_long(long m) {if(mLIM) return(MX); return(m);} int int_ulong(unsigned long m) {if(mLIM) return(MX); return(m);} unsigned int uint_char(char m) {if(mLIM) return(MX); return(m);} unsigned int uint_int(int m) {if(mLIM) return(MX); return(m);} unsigned int uint_uint(unsigned int m) //Nonoptimal {if(mLIM) return(MX); return(m);} unsigned int uint_long(long m) {if(mLIM) return(MX); return(m);} unsigned int uint_ulong(unsigned long m) {if(mLIM) return(MX); return(m);} long long_char(char m) {if(mLIM) return(MX); return(m);} long long_int(int m) {if(mLIM) return(MX); return(m);} long long_uint(unsigned int m) {if(mLIM) return(MX); return(m);} long long_long(long m) //Nonoptimal {if(mLIM) return(MX); return(m);} long long_ulong(unsigned long m) {if(mLIM) return(MX); return(m);} unsigned long ulong_char(char m) {if(mLIM) return(MX); return(m);} unsigned long ulong_int(int m) {if(mLIM) return(MX); return(m);} unsigned long ulong_uint(unsigned int m) {if(mLIM) return(MX); return(m);} unsigned long ulong_long(long m) {if(mLIM) return(MX); return(m);} unsigned long ulong_ulong(unsigned long m) //Nonoptimal {if(mLIM) return(MX); return(m);} # gcc -o t test.c -O2 -c # objdump -d t t: file format elf64-x86-64 Disassembly of section .text: char_char: 0: 89 f8 mov%edi,%eax 2: 40 80 ff 08 cmp$0x8,%dil 6: ba 00 00 00 00 mov$0x0,%edx b: 0f 4d c2cmovge %edx,%eax--- It's ok! Optimal e: c3 retq f: 90 nop skip... 0060 int_int: 60: 83 ff 08cmp$0x8,%edi 63: b8 00 00 00 00 mov$0x0,%eax 68: 0f 4d f8cmovge %eax,%edi--- Nonoptimal 6b: 89 f8 mov%edi,%eax--- Nonoptimal 6d: c3 retq 6e: 66 90 xchg %ax,%ax skip... 00c0 uint_uint: c0: 83 ff 08cmp$0x8,%edi c3: b8 00 00 00 00 mov$0x0,%eax c8: 0f 43 f8cmovae %eax,%edi--- Nonoptimal cb: 89 f8 mov%edi,%eax--- Nonoptimal cd: c3 retq ce: 66 90 xchg %ax,%ax skip... 0120 long_long: 120: 48 83 ff 08 cmp$0x8,%rdi 124: b8 00 00 00 00 mov$0x0,%eax 129: 48 0f 4d f8 cmovge %rax,%rdi--- Nonoptimal 12d: 48 89 f8mov%rdi,%rax--- Nonoptimal 130: c3 retq skip... 0190 ulong_ulong: 190: 48 83 ff 08 cmp$0x8,%rdi 194: b8 00 00 00 00 mov$0x0,%eax 199: 48 0f 43 f8 cmovae %rax,%rdi--- Nonoptimal 19d: 48 89 f8mov%rdi,%rax--- Nonoptimal 1a0: c3 retq -- Summary: Nonoptimal code - CMOVxx %eax,%edi; mov%edi,%eax; retq Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vvv at ru dot ru http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40072
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #11 from vvv at ru dot ru 2009-04-29 07:46 --- (In reply to comment #8) From config/i386/i386.c: /* AMD Athlon works faster when RET is not destination of conditional jump or directly preceded by other jump instruction. We avoid the penalty by inserting NOP just before the RET instructions in such cases. */ static void ix86_pad_returns (void) ... But I am using Core 2 Duo. Why we see multibyte nop, not single byte nop? Why if change line u = F(u)*3+1; to u = F(u)*4+1; or u = F(u); number of nops changed? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #12 from vvv at ru dot ru 2009-04-29 07:55 --- (In reply to comment #9) So that explains it, Use -Os or attribute cold if you want NOPs to be gone. But my measurements on Core 2 Duo P8600 show that push %ebp mov %esp,%ebp leave ret _faster_ then push %ebp mov %esp,%ebp leave xchg %ax,%ax ret -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #15 from vvv at ru dot ru 2009-04-29 19:16 --- One more example 5-bytes nop between leaveq and retq. # cat test.c void wait_for_enter() { int u = getchar(); while (!u) u = getchar()-13; } main() { wait_for_enter(); } # gcc -o t.out test.c -O2 -march=core2 -fno-omit-frame-pointer # objdump -d t.out ... 00400540 wait_for_enter: 400540: 55 push %rbp 400541: 31 c0 xor%eax,%eax 400543: 48 89 e5mov%rsp,%rbp 400546: e8 f5 fe ff ff callq 400440 getc...@plt 40054b: 85 c0 test %eax,%eax 40054d: 75 13 jne400562 wait_for_enter+0x22 40054f: 90 nop 400550: 31 c0 xor%eax,%eax 400552: e8 e9 fe ff ff callq 400440 getc...@plt 400557: 83 f8 0dcmp$0xd,%eax 40055a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 400560: 74 ee je 400550 wait_for_enter+0x10 400562: c9 leaveq 400563: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) --NONOPTIMAL! 400568: c3 retq 400569: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 00400570 main: 400570: 55 push %rbp 400571: 31 c0 xor%eax,%eax 400573: 48 89 e5mov%rsp,%rbp 400576: e8 c5 ff ff ff callq 400540 wait_for_enter 40057b: c9 leaveq 40057c: c3 retq 40057d: 90 nop 40057e: 90 nop 40057f: 90 nop So bug unresolved. -- vvv at ru dot ru changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID | http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq
Sometimes we can see 2 bytes nop (xchg %ax,%ax) between leaveq and retq. IMHO, better to remove xchg %ax,%ax Examples from Kernel 2.6.29.1: gcc --version gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291] objdump vmlinux ... 804262e0 set_blitting_type: 804262e0: 55 push %rbp 804262e1: 0f b7 07movzwl (%rdi),%eax 804262e4: 4c 8b 86 d0 03 00 00mov0x3d0(%rsi),%r8 804262eb: 48 c1 e0 07 shl$0x7,%rax 804262ef: 48 89 e5mov%rsp,%rbp 804262f2: 48 05 40 1f 9c 80 add$0x809c1f40,%rax 804262f8: 49 89 80 90 01 00 00mov%rax,0x190(%r8) 804262ff: 8b 46 04mov0x4(%rsi),%eax 80426302: 89 c1 mov%eax,%ecx 80426304: 81 e1 00 00 02 00 and$0x2,%ecx 8042630a: 75 2c jne80426338 set_blitting_type+0x58 8042630c: 48 8b 86 d0 03 00 00mov0x3d0(%rsi),%rax 80426313: 4c 89 c7mov%r8,%rdi 80426316: 48 8b 90 90 01 00 00mov0x190(%rax),%rdx 8042631d: 8b 52 1cmov0x1c(%rdx),%edx 80426320: 83 fa 03cmp$0x3,%edx 80426323: 0f 4e cacmovle %edx,%ecx 80426326: 89 88 b0 01 00 00 mov%ecx,0x1b0(%rax) 8042632c: e8 2f 4d 00 00 callq 8042b060 fbcon_set_bitops 80426331: c9 leaveq 80426332: c3 retq 80426333: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 80426338: e8 23 61 00 00 callq 8042c460 fbcon_set_tileops 8042633d: c9 leaveq 8042633e: 66 90 xchg %ax,%ax 80426340: c3 retq ... ... 8042b060 fbcon_set_bitops: 8042b060: 55 push %rbp 8042b061: 48 c7 07 d0 ad 42 80movq $0x8042add0,(%rdi) 8042b068: 8b 87 b0 01 00 00 mov0x1b0(%rdi),%eax 8042b06e: 48 89 e5mov%rsp,%rbp 8042b071: 48 c7 47 08 30 ae 42movq $0x8042ae30,0x8(%rdi) 8042b078: 80 8042b079: 48 c7 47 10 c0 b7 42movq $0x8042b7c0,0x10(%rdi) 8042b080: 80 8042b081: 48 c7 47 18 10 af 42movq $0x8042af10,0x18(%rdi) 8042b088: 80 8042b089: 48 c7 47 20 10 b1 42movq $0x8042b110,0x20(%rdi) 8042b090: 80 8042b091: 48 c7 47 28 c0 b0 42movq $0x8042b0c0,0x28(%rdi) 8042b098: 80 8042b099: 48 c7 47 30 00 00 00movq $0x0,0x30(%rdi) 8042b0a0: 00 8042b0a1: 85 c0 test %eax,%eax 8042b0a3: 75 0b jne8042b0b0 fbcon_set_bitops+0x50 8042b0a5: c9 leaveq 8042b0a6: c3 retq 8042b0a7: 66 0f 1f 84 00 00 00nopw 0x0(%rax,%rax,1) 8042b0ae: 00 00 8042b0b0: e8 4b 15 00 00 callq 8042c600 fbcon_set_rotate 8042b0b5: c9 leaveq 8042b0b6: 66 90 xchg %ax,%ax 8042b0b8: c3 retq -- Summary: Nonoptimal code - leaveq; xchg %ax,%ax; retq Product: gcc Version: 4.3.2 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vvv at ru dot ru http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #2 from vvv at ru dot ru 2009-04-28 17:04 --- Created an attachment (id=17776) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17776action=view) Source file from Linx Kernel 2.6.29.1 See static void set_blitting_type -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #3 from vvv at ru dot ru 2009-04-28 17:10 --- Additional examples from Linux Kernel 2.6.29.1: (Note: conditional statement at the end of all fuctions!) = linux/drivers/video/console/bitblit.c void fbcon_set_bitops(struct fbcon_ops *ops) { ops-bmove = bit_bmove; ops-clear = bit_clear; ops-putcs = bit_putcs; ops-clear_margins = bit_clear_margins; ops-cursor = bit_cursor; ops-update_start = bit_update_start; ops-rotate_font = NULL; if (ops-rotate) fbcon_set_rotate(ops); } 8020a5e0 disable_TSC: 8020a5e0: 55 push %rbp 8020a5e1: bf 01 00 00 00 mov$0x1,%edi 8020a5e6: 48 89 e5mov%rsp,%rbp 8020a5e9: e8 c2 fd 35 00 callq 8056a3b0 add_preempt_count 8020a5ee: 65 48 8b 04 25 10 00mov%gs:0x10,%rax 8020a5f5: 00 00 8020a5f7: 48 2d c8 1f 00 00 sub$0x1fc8,%rax 8020a5fd: f0 0f ba 28 10 lock btsl $0x10,(%rax) 8020a602: 19 d2 sbb%edx,%edx 8020a604: 85 d2 test %edx,%edx 8020a606: 75 0a jne8020a612 disable_TSC+0x32 8020a608: 0f 20 e0mov%cr4,%rax 8020a60b: 48 83 c8 04 or $0x4,%rax 8020a60f: 0f 22 e0mov%rax,%cr4 8020a612: bf 01 00 00 00 mov$0x1,%edi 8020a617: e8 e4 fc 35 00 callq 8056a300 sub_preempt_count 8020a61c: 65 48 8b 04 25 10 00mov%gs:0x10,%rax 8020a623: 00 00 8020a625: f6 80 38 e0 ff ff 08testb $0x8,-0x1fc8(%rax) 8020a62c: 75 02 jne8020a630 disable_TSC+0x50 8020a62e: c9 leaveq 8020a62f: c3 retq 8020a630: e8 2b 99 35 00 callq 80563f60 preempt_schedule 8020a635: c9 leaveq 8020a636: 66 90 xchg %ax,%ax 8020a638: c3 retq == /arch/x86/kernel/io_delay.c void native_io_delay(void) { switch (io_delay_type) { default: case CONFIG_IO_DELAY_TYPE_0X80: asm volatile (outb %al, $0x80); break; case CONFIG_IO_DELAY_TYPE_0XED: asm volatile (outb %al, $0xed); break; case CONFIG_IO_DELAY_TYPE_UDELAY: /* * 2 usecs is an upper-bound for the outb delay but * note that udelay doesn't have the bus-level * side-effects that outb does, nor does udelay() have * precise timings during very early bootup (the delays * are shorter until calibrated): */ udelay(2); case CONFIG_IO_DELAY_TYPE_NONE: break; } } EXPORT_SYMBOL(native_io_delay); 802131e0 native_io_delay: 802131e0: 55 push %rbp 802131e1: 8b 05 3d b3 54 00 mov0x54b33d(%rip),%eax # 8075e524 io_delay_type 802131e7: 48 89 e5mov%rsp,%rbp 802131ea: 83 f8 02cmp$0x2,%eax 802131ed: 74 29 je 80213218 native_io_delay+0x38 802131ef: 83 f8 03cmp$0x3,%eax 802131f2: 74 06 je 802131fa native_io_delay+0x1a 802131f4: ff c8 dec%eax 802131f6: 74 10 je 80213208 native_io_delay+0x28 802131f8: e6 80 out%al,$0x80 802131fa: c9 leaveq 802131fb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 80213200: c3 retq 80213201: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 80213208: e6 ed out%al,$0xed 8021320a: c9 leaveq 8021320b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 80213210: c3 retq 80213211: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 80213218: bf 8e 21 00 00 mov$0x218e,%edi 8021321d: 0f 1f 00nopl (%rax) 80213220: e8 fb ac 1e 00 callq 803fdf20 __const_udelay 80213225: c9 leaveq 80213226: 66 90 xchg %ax,%ax 80213228: c3
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #4 from vvv at ru dot ru 2009-04-28 17:15 --- Created an attachment (id=1) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=1action=view) Simple example from Linux See two functons: static void pre_schedule_rt static void switched_from_rt -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #6 from vvv at ru dot ru 2009-04-28 21:18 --- Let's compile file test.c //#file test.c extern int F(int m); void func(int x) { int u = F(x); while (u) u = F(u)*3+1; } # gcc -o t.out test.c -c -O2 # objdump -d t.out t.out: file format elf64-x86-64 Disassembly of section .text: func: 0: 48 83 ec 08 sub$0x8,%rsp 4: e8 00 00 00 00 callq 9 func+0x9 9: 85 c0 test %eax,%eax b: 89 c7 mov%eax,%edi d: 74 0e je 1d func+0x1d f: 90 nop 10: e8 00 00 00 00 callq 15 func+0x15 15: 8d 7c 40 01 lea0x1(%rax,%rax,2),%edi 19: 85 ff test %edi,%edi 1b: 75 f3 jne10 func+0x10 1d: 48 83 c4 08 add$0x8,%rsp 21: 0f 1f 80 00 00 00 00nopl 0x0(%rax) nonoptimal 28: c3 retq -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug c/39549] New: Nonoptimal byte load. mov (%rdi),%al better then movzbl (%rdi),%eax
gcc --version gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291] cat test.c // file test.c One byte transfer void f(char *a,char *b){ *b=*a; } void F(char *a,char *b){ asm volatile(mov (%rdi),%al\nmov %al,(%rsi)); } ... gcc -g -otest test.c -O2 -mtune=core2 objdump -d test 004004f0 f: 4004f0: 0f b6 07movzbl (%rdi),%eax 4004f3: 88 06 mov%al,(%rsi) 4004f5: c3 retq 4004f6: 66 2e 0f 1f 84 00 00nopw %cs:0x0(%rax,%rax,1) 4004fd: 00 00 00 00400500 F: 400500: 8a 07 mov(%rdi),%al 400502: 88 06 mov%al,(%rsi) 400504: c3 retq GCC use movzbl (%rdi),%eax, but better to use mov (%rdi),%al, because last instruction 1 byte shorter. Execution time the same (at least on Core 2 Duo and Core 2 Solo). Probably it is result of Intel recomendations to use movz to avoid a partial register stall. But smaller instruction reduce fetch bandwidth... and Qwote from: Intel® 64 and IA-32 Architectures Optimization Reference Manual 248966. 3.5.2.3 Partial Register Stalls The delay of a partial register stall is small in processors based on Intel Core and NetBurst microarchitectures, and in Pentium M processor (with CPUID signature family 6, model 13), Intel Core Solo, and Intel Core Duo processors. Pentium M processors (CPUID signature with family 6, model 9) and the P6 family incur a large penalty. -- Summary: Nonoptimal byte load. mov (%rdi),%al better then movzbl (%rdi),%eax Product: gcc Version: 4.3.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vvv at ru dot ru http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39549
[Bug c/39520] New: Empty function translated to repz retq.
gcc --version gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291] cat test.c // file test.c Call to empty function void f(){ } int main(){ return(0);} gcc -o test test.c -O2 objdump -d test 004004f0 f: 4004f0: f3 c3 repz retq Why rep ret? Why not ret? -- Summary: Empty function translated to repz retq. Product: gcc Version: 4.3.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vvv at ru dot ru http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39520