[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #53 from Peter Cordes --- I think we can close this as fixed at some point. The last activity on this bug was some patches that sound like they were supposed to fix, and the MCVEs from comments I tested no longer has a problem. GCC9.3 -O3 -march=core2 -fomit-frame-pointer only uses a `.p2align` to align the top of the loop, not between leave and ret or between cmp/jcc. void wait_for_enter() { volatile int foo = 0; // to get a LEAVE instruction emitted at all int u = getchar(); while (!u) u = getchar()-13; } https://godbolt.org/z/RvxzZv (Note that Godbolt normally filters .p2align so you have to either compile to binary or not filter directives in the asm source. Otherwise you'll never see NOPs except in the unusual case where GCC actually emits a nop mnemonic.)
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #52 from jakub at gcc dot gnu dot org 2009-05-21 13:26 --- Subject: Bug 39942 Author: jakub Date: Thu May 21 13:26:13 2009 New Revision: 147766 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147766 Log: PR target/39942 * config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second .p2align 3 if MAX_SKIP is smaller than 7. * config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise. Modified: branches/gcc-4_3-branch/gcc/ChangeLog branches/gcc-4_3-branch/gcc/config/i386/linux.h branches/gcc-4_3-branch/gcc/config/i386/x86-64.h -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #51 from jakub at gcc dot gnu dot org 2009-05-21 13:21 --- Subject: Bug 39942 Author: jakub Date: Thu May 21 13:21:30 2009 New Revision: 147765 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147765 Log: PR target/39942 * config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second .p2align 3 if MAX_SKIP is smaller than 7. * config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise. Modified: branches/gcc-4_4-branch/gcc/ChangeLog branches/gcc-4_4-branch/gcc/config/i386/linux.h branches/gcc-4_4-branch/gcc/config/i386/x86-64.h -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #50 from jakub at gcc dot gnu dot org 2009-05-20 22:09 --- nopl 0x0(%rax,%rax,1) and nopw 0x0(%rax,%rax,1) aren't padding (though, it has been added in this case for label alignment or function entry alignment, not to avoid 4+ jumps in one 16byte page)? Anyway, you want to look at both -S output and objdump -d of -c output, there you'll see needed .p2align added. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #49 from vvv at ru dot ru 2009-05-20 21:38 --- (In reply to comment #48) How this patches work? Is it required some special options? # /media/disk-1/B/bin/gcc --version gcc (GCC) 4.5.0 20090520 (experimental) # cat test.c void f(int i) { if (i == 1) F(1); if (i == 2) F(2); if (i == 3) F(3); if (i == 4) F(4); if (i == 5) F(5); } extern int F(int m); void func(int x) { int u = F(x); while (u) u = F(u)*3+1; } # /media/disk-1/B/bin/gcc -o t test.c -O2 -c -mtune=k8 # objdump -d t : 0: 83 ff 01cmp$0x1,%edi 3: 74 1b je 20 5: 83 ff 02cmp$0x2,%edi 8: 74 16 je 20 a: 83 ff 03cmp$0x3,%edi d: 74 11 je 20 f: 83 ff 04cmp$0x4,%edi 12: 74 0c je 20 14: 83 ff 05cmp$0x5,%edi 17: 74 07 je 20 19: f3 c3 repz retq 1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 20: 31 c0 xor%eax,%eax 22: e9 00 00 00 00 jmpq 27 27: 66 0f 1f 84 00 00 00nopw 0x0(%rax,%rax,1) 2e: 00 00 0030 : 30: 48 83 ec 08 sub$0x8,%rsp 34: e8 00 00 00 00 callq 39 39: 85 c0 test %eax,%eax 3b: 89 c7 mov%eax,%edi 3d: 74 0e je 4d 3f: 90 nop 40: e8 00 00 00 00 callq 45 45: 8d 7c 40 01 lea0x1(%rax,%rax,2),%edi 49: 85 ff test %edi,%edi 4b: 75 f3 jne40 4d: 48 83 c4 08 add$0x8,%rsp 51: c3 retq I can't see any padding in function f :( PS. In file config/i386/i386.c (ix86_avoid_jump_mispredicts) /* Look for all minimal intervals of instructions containing 4 jumps. ... Not jumps, but _branches_ (CALL, JMP, conditional branches, or returns) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #48 from hjl at gcc dot gnu dot org 2009-05-18 17:21 --- Subject: Bug 39942 Author: hjl Date: Mon May 18 17:21:13 2009 New Revision: 147671 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147671 Log: 2009-05-18 H.J. Lu PR target/39942 * config/i386/i386.c (ix86_avoid_jump_misspredicts): Replace gen_align with gen_pad. (ix86_reorg): Check ASM_OUTPUT_MAX_SKIP_PAD instead of #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN. * config/i386/i386.h (ASM_OUTPUT_MAX_SKIP_PAD): New. * config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_PAD): Likewise. * config/i386/i386.md (align): Renamed to ... (pad): This. Replace ASM_OUTPUT_MAX_SKIP_ALIGN with ASM_OUTPUT_MAX_SKIP_PAD. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.h trunk/gcc/config/i386/i386.md trunk/gcc/config/i386/x86-64.h -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #47 from jakub at gcc dot gnu dot org 2009-05-16 07:12 --- Subject: Bug 39942 Author: jakub Date: Sat May 16 07:12:02 2009 New Revision: 147607 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147607 Log: PR target/39942 * final.c (label_to_max_skip): New function. (label_to_alignment): Only use LABEL_TO_ALIGNMENT if CODE_LABEL_NUMBER <= max_labelno. * output.h (label_to_max_skip): New prototype. * config/i386/i386.c (ix86_avoid_jump_misspredicts): Renamed to... (ix86_avoid_jump_mispredicts): ... this. Don't define if ASM_OUTPUT_MAX_SKIP_ALIGN isn't defined. Update comment. Handle CODE_LABELs with >= 16 byte alignment or with max_skip == (1 << align) - 1. (ix86_reorg): Don't call ix86_avoid_jump_mispredicts if ASM_OUTPUT_MAX_SKIP_ALIGN isn't defined. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/final.c trunk/gcc/output.h -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #46 from jakub at gcc dot gnu dot org 2009-05-16 07:10 --- Subject: Bug 39942 Author: jakub Date: Sat May 16 07:09:52 2009 New Revision: 147606 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147606 Log: PR target/39942 * config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second .p2align 3 if MAX_SKIP is smaller than 7. * config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/linux.h trunk/gcc/config/i386/x86-64.h -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #45 from jakub at gcc dot gnu dot org 2009-05-16 06:37 --- cmpl $1, %eax does have the modrm byte: 83 f8 01 cmp$0x1,%eax compared to cmpl $0xdeadbeef, $eax which doesn't have it: 3d ef be ad de cmp$0xdeadbeef,%eax So I think what I wrote is more precise. modrm byte is there if the insn has ax_reg_operand destination and immediate source which hasn't been shortened. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #44 from hjl dot tools at gmail dot com 2009-05-15 23:05 --- (In reply to comment #41) > The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me > look at one of the cases which was wrong and the problem is that cmp $0x1d, > %al > has too large get_attr_lenght (insn) returned, 3 instead of 2, because GCC > thinks it has modrm byte when it has not. > Testing: > --- gcc/config/i386/i386.md.jj2009-05-13 08:42:51.0 +0200 > +++ gcc/config/i386/i386.md2009-05-15 18:06:40.0 +0200 > @@ -504,6 +504,9 @@ > (and (eq_attr "type" "callv") >(match_operand 1 "constant_call_address_operand" "")) > (const_int 0) > + (and (eq_attr "type" "alu,alu1,icmp,test") > + (match_operand 0 "ax_reg_operand" "")) > + (symbol_ref "(get_attr_length_immediate (insn) > (get_attr_mode (insn) > != > MODE_QI))") > ] > (const_int 1))) > "cmp imm,%al/%ax/%eax/%rax" doesn't have the modrm byte. I think this patch works better: --- i386.md.branch 2009-05-15 11:30:42.0 -0700 +++ i386.md 2009-05-15 14:44:11.0 -0700 @@ -504,6 +504,10 @@ (and (eq_attr "type" "callv") (match_operand 1 "constant_call_address_operand" "")) (const_int 0) +(and (eq_attr "type" "alu,alu1,icmp,test") + (match_operand 0 "ax_reg_operand" "") + (match_operand 1 "immediate_operand" "")) +(const_int 0) ] (const_int 1))) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #43 from jakub at gcc dot gnu dot org 2009-05-15 18:23 --- Some code size growth is from enlarged get_attr_modrm though, 292 bytes for 64-bit, 1338 bytes for 32-bit. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #42 from jakub at gcc dot gnu dot org 2009-05-15 18:18 --- Sizes with the #c41 patch together with the 3 patches mentioned in #c31 are: 0x890038 (64-bit) and 0x8ce08c (32-bit), 44 bad 16-byte pages in 64-bit, 35 in 32-bit. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #41 from jakub at gcc dot gnu dot org 2009-05-15 16:24 --- The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me look at one of the cases which was wrong and the problem is that cmp $0x1d, %al has too large get_attr_lenght (insn) returned, 3 instead of 2, because GCC thinks it has modrm byte when it has not. Testing: --- gcc/config/i386/i386.md.jj2009-05-13 08:42:51.0 +0200 +++ gcc/config/i386/i386.md2009-05-15 18:06:40.0 +0200 @@ -504,6 +504,9 @@ (and (eq_attr "type" "callv") (match_operand 1 "constant_call_address_operand" "")) (const_int 0) + (and (eq_attr "type" "alu,alu1,icmp,test") + (match_operand 0 "ax_reg_operand" "")) + (symbol_ref "(get_attr_length_immediate (insn) > (get_attr_mode (insn) != MODE_QI))") ] (const_int 1))) now on top of the 3 patches (without the 4th) to see what it does to code size and number of 4+ branches in 16 byte page. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #40 from hjl dot tools at gmail dot com 2009-05-15 14:35 --- (In reply to comment #37) > This patch looks very wrong. It assumes that min_insn_size gives exact insn > sizes (current min_insn_size is very far from that, but even get_attr_length > isn't exact), doesn't take into account label alignments nor branch shortening > which can change the insn sizes afterwards and assumes that a p2align always > aligns to 16 bytes (it does not). > While the previous algorithm works with estimated 16 consecutive bytes rather > than 16 byte pages 0x0 ... 0xXXXF, that's because during machine reorg > you simply can't know in most cases where exactly the 16 byte page will start, > so you assume it can start (almost) anywhere (and use .p2align behavior to > align when needed). > There is no perfect solution here. Let's list pros/cons: The current algorithm: pros: 1. Very conservative. Catch most of 4 branches in 16byte windows. Cons: 1. It works on 16byte window, not 16byte page. 2. When it gets wrong, it increases code sizes by adding unnecessary nops. My proposal: Pros: 1. Work on 16byte page. 2. Even if it gets wrong, it doesn't increase code size. Cons: 1. Rely on inaccurate instruction length data. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #39 from jakub at gcc dot gnu dot org 2009-05-15 12:12 --- Created an attachment (id=17874) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17874&action=view) test4jmp.sh -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #38 from jakub at gcc dot gnu dot org 2009-05-15 12:11 --- To extend #c31, I've also built the same tree with another patch which made sure ix86_avoid_jump_mispredicts is never called (change "&& optimize" into "&& optimize > 4" in ix86_reorg). cc1plus sizes were then 0x88d6d8 bytes for 64-bit cc1plus and 0x8c980c bytes for 32-bit cc1plus. That is 2.42% smaller than vanilla resp. 2.06%. I've also changed my testing script, so that it (hopefully, point me to errors) follows the AMD rules more closely (namely that the last byte in the branch insn counts, not the first one), will attach. With that script, I got following number of violations for 64-bit cc1plus (vanilla, +1patch, +2patches, +3patches, +4patches): 6, 7, 6, 51, 138 and for 32-bit cc1plus: 1, 2, 3, 34, 159. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #37 from jakub at gcc dot gnu dot org 2009-05-15 07:56 --- This patch looks very wrong. It assumes that min_insn_size gives exact insn sizes (current min_insn_size is very far from that, but even get_attr_length isn't exact), doesn't take into account label alignments nor branch shortening which can change the insn sizes afterwards and assumes that a p2align always aligns to 16 bytes (it does not). While the previous algorithm works with estimated 16 consecutive bytes rather than 16 byte pages 0x0 ... 0xXXXF, that's because during machine reorg you simply can't know in most cases where exactly the 16 byte page will start, so you assume it can start (almost) anywhere (and use .p2align behavior to align when needed). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #36 from hjl dot tools at gmail dot com 2009-05-15 04:32 --- Created an attachment (id=17871) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17871&action=view) An updated patch A few comments: 1. 3 branch limit is per 16byte page, not 16byte window. 2. We should allow 3 branches in a 16byte page. 3. When we have 4 branches in a 16byte page, we only need to pad to the 16byte page boundary before the 4th branch, which will start at the next 16byte page. -- hjl dot tools at gmail dot com changed: What|Removed |Added Attachment #17870|0 |1 is obsolete|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #35 from hjl dot tools at gmail dot com 2009-05-15 02:23 --- Created an attachment (id=17870) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17870&action=view) A patch This patch limits 3 branches per 16byte page. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #34 from vvv at ru dot ru 2009-05-14 19:43 --- (In reply to comment #32) > Please make sure that you only test nop paddings for branch insns, > not nop paddings for branch targets, which prefer 16byte alignment. Additional tests (for Core2) results: 1. Execution time don't depend on paddings for branch target. 2. Execution time don't depend on position of NOP within 16-byte chunk with 4 branch. Even if NOP inserted between CMP and conditional jump. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #33 from hjl dot tools at gmail dot com 2009-05-14 18:37 --- (In reply to comment #20) > Instruction decoders generally operate on whole cache-lines, so 16-byte chunk > very very likely refers to a cache-line. > That is true. For Intel CPUs, "16-bytes chunk" means memory range XXX0 - XXXF. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #32 from hjl dot tools at gmail dot com 2009-05-14 15:58 --- (In reply to comment #30) > Created an attachment (id=17863) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view) [edit] > Testing tool. > Please make sure that you only test nop paddings for branch insns, not nop paddings for branch targets, which prefer 16byte alignment. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #31 from jakub at gcc dot gnu dot org 2009-05-14 15:15 --- Some -O2 code size data from today's trunk bootstraps. The first .text line is always vanilla bootstrap, the second one with http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html only, the third one with that patch and http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00703.html and the last with additional: --- i386.c.jj3 2009-05-14 12:41:24.0 +0200 +++ i386.c 2009-05-14 14:48:24.0 +0200 @@ -27202,7 +27202,7 @@ x86_function_profiler (FILE *file, int l static int min_insn_size (rtx insn) { - int l = 0; + int l = 0, len; if (!INSN_P (insn) || !active_insn_p (insn)) return 0; @@ -27222,7 +27222,8 @@ min_insn_size (rtx insn) && symbolic_reference_mentioned_p (PATTERN (insn)) && !SIBLING_CALL_P (insn)) return 5; - if (get_attr_length (insn) <= 1) + len = get_attr_length (insn); + if (len <= 1) return 1; /* For normal instructions we may rely on the sizes of addresses @@ -27230,6 +27231,9 @@ min_insn_size (rtx insn) This is not the case for jumps where references are PC relative. */ if (!JUMP_P (insn)) { + if (get_attr_type (insn) != TYPE_MULTI) + return len; + l = get_attr_length_address (insn); if (l < 4 && symbolic_reference_mentioned_p (PATTERN (insn))) l = 4; to see how the code size changes with much more accurate (though sometimes not minimum but maximum bound) insn sizing for the algorithm. 64-bit cc1plus [12] .text PROGBITS0047f990 07f990 8c3ba8 00 AX 0 0 16 [12] .text PROGBITS0047f990 07f990 89b1e8 00 AX 0 0 16 [12] .text PROGBITS0047f9c0 07f9c0 899f78 00 AX 0 0 16 [12] .text PROGBITS0047f9c0 07f9c0 88eaf8 00 AX 0 0 16 32-bit cc1plus [12] .text PROGBITS080b24e0 06a4e0 8f8cac 00 AX 0 0 16 [12] .text PROGBITS080b24e0 06a4e0 8d516c 00 AX 0 0 16 [12] .text PROGBITS080b2510 06a510 8d507c 00 AX 0 0 16 [12] .text PROGBITS080b2510 06a510 8cbd7c 00 AX 0 0 16 For 64-bit cc1plus that's 1.8%, 1.86%, 2.36% smaller binary with the 1, 2 resp. 3 patches, for 32-bit cc1plus 1.55%, 1.56%, 1.96% smaller binary. So the first patch is the most important and something like the third one, perhaps with more exceptions, also makes a difference. I'll now try to update my awk script to check for the AMD rules, namely that the last byte of the branch insn counts rather than the first. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #30 from vvv at ru dot ru 2009-05-14 09:01 --- Created an attachment (id=17863) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view) Testing tool. Here is results of my testing. Code: align 128 test_cikl: rept 14 ; 14 if SH=0, 15 if SH=1, 16 if SH=2 { nop } cmp al,0 ; 2 bytes jz $+10h+NOPS ; 2 bytes offset=0 cmp al,1 ; 2 bytes offset=2 jz $+0Ch+NOPS ; 2 bytes offset=4 cmp al,2 ; 2 bytes offset=6 jz $+08h+NOPS ; 2 bytes offset=8 cmp al,3 ; 2 bytes offset=A match =1, NOPS { nop } match =2, NOPS { xchg eax,eax ; 2-bytes NOP } jz $+04h ; 2 bytes offset=C ja $+02h ; 2 bytes offset=E mov eax,ecx and eax,7h loop test_cikl This code tested on Core2,Xeon and P4 CPU. Results in RDTSC ticks. ; Core 2 Duo ;NOPS/tick/Max NOPS/tick/MaxNOPS/tick/Max ; SH=0 0/571/729 1/306/594 2/315/630 ; SH=1 0/338/612 1/338/648 2/339/648 ; SH=2 0/339/666 1/339/675 2/333/693 ; Xeon 3110 ;NOPS/tick/Max NOPS/tick/MaxNOPS/tick/Max ; SH=0 0/586/693 1/310/675 2/310/675 ; SH=1 0/333/657 1/330/648 2/464/630 ; SH=2 0/333/657 1/470/594 2/474/603 ; P4 ;NOPS/tick/Max NOPS/tick/MaxNOPS/tick/Max ; SH=0 0/1027/1317 1/1094/1258 2/1028/1207 ; SH=1 0/1151/1377 1/1068/1352 2/902/1275 ; SH=2 0/1124/1275 1/1148/1335 2/979/1139 Conclusion: 1. Core2 and Xeon - similar results. P4 - something strange. For Core2 & Xeon padding very effective. Code with padding almoust 2 times faster. No sence for P4? 2. My previous sentence VVV> 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),but VVV> Intel limitation for 16-bytes chunk (memory range - +10h) is wrong. At leat for Core2 & Xeon. For this CPU "16-bytes chunk" means memory range XXX0 - XXXF. Unfortunately, I can't test AMD. PS. My testing tool in attachmen. It start under MSDOS, switch to 32-bit mode, switch to 64-bit mode and measure rdtsc ticks for test code. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #29 from hjl dot tools at gmail dot com 2009-05-13 21:44 --- Created an attachment (id=17858) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17858&action=view) Impact of X86_TUNE_FOUR_JUMP_LIMIT on SPEC CPU 2K This is my old data of X86_TUNE_FOUR_JUMP_LIMIT on Penryn and Nehalem. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #28 from vvv at ru dot ru 2009-05-13 19:18 --- (In reply to comment #24) > Using padding to avoid 4 branches in 16byte chunk may not be a good idea since > it will increase code size. It's enough only one byte NOP per 16-byte chunk for padding. But, IMHO, four branches in 16 byte chunk - is very-very infrequent. Especially for 64-bit mode. BTW, it's difficult to understand, what Intel mean ander term "branch". Is it CALL, JMP, conditional branches, or returns (same as AMD), or only JMP and conditional branches. I beleave last case right. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #27 from jakub at gcc dot gnu dot org 2009-05-13 19:08 --- If inserting the padding isn't worth it for say core2, m_CORE2 could be dropped from X86_TUNE_FOUR_JUMP_LIMIT, but certainly it would be interesting to see SPEC numbers backing that up. Similarly for AMD CPUs, and if on at least one of these it is beneficial, probably m_GENERIC should keep it, though with far improved min_insn_size so that it doesn't trigger unnecessarily so often. Vlad/Honza, could one of you SPEC test say current 4.4 with one (or both of): http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00703.html on top of it (no need to compare clearly unneeded paddings to no paddings, better compare only somewhat needed paddings against no paddings) compared with X86_TUNE_FOUR_JUMP_LIMIT cleared for the used CPU? -- jakub at gcc dot gnu dot org changed: What|Removed |Added CC||vmakarov at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #26 from vvv at ru dot ru 2009-05-13 19:05 --- (In reply to comment #23) > Note that we need something that works for the generic model as well, which in > this case looks like it is the same as for AMD models. There is processor property TARGET_FOUR_JUMP_LIMIT, may be create new one - TARGET_FIVE_JUMP_LIMIT? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #25 from vvv at ru dot ru 2009-05-13 18:56 --- (In reply to comment #22) > CCing H.J for Intel optimization issues. VVV> 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but VVV> Intel limitation for 16-bytes chunk (memory range - +10h) I have a doubt about this now. Sanks to Richard Guenther (Comment #20). So I am going to make measurements for check it for Core2. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #24 from hjl dot tools at gmail dot com 2009-05-13 18:45 --- Using padding to avoid 4 branches in 16byte chunk may not be a good idea since it will increase code size. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #23 from rguenth at gcc dot gnu dot org 2009-05-13 18:34 --- Note that we need something that works for the generic model as well, which in this case looks like it is the same as for AMD models. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #22 from ubizjak at gmail dot com 2009-05-13 18:22 --- (In reply to comment #21) > I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD > OpteronTM > processors, but it is nonoptimal for Intel processors. Because: ... CCing H.J for Intel optimization issues. -- ubizjak at gmail dot com changed: What|Removed |Added CC||hjl dot tools at gmail dot ||com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #21 from vvv at ru dot ru 2009-05-13 17:13 --- I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM processors, but it is nonoptimal for Intel processors. Because: 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but Intel limitation for 16-bytes chunk (memory range - +10h) 2. AMD - maximum of _THREE_ near branches (CALL, JMP, conditional branches, or returns), Intel - maximum of _FOUR_ branches! Quotation from Software Optimization Guide for AMD64 Processors 6.1 Density of Branches When possible, align branches such that they do not cross a 16-byte boundary. The AMD AthlonTM 64 and AMD OpteronTM processors have the capability to cache branch-prediction history for a maximum of three near branches (CALL, JMP, conditional branches, or returns) per 16-byte fetch window. A branch instruction that crosses a 16-byte boundary is counted in the second 16-byte window. Due to architectural restrictions, a branch that is split across a 16-byte boundary cannot dispatch with any other instructions when it is predicted taken. Perform this alignment by rearranging code; it is not beneficial to align branches using padding sequences. The following branches are limited to three per 16-byte window: jcc rel8 jcc rel32 jmp rel8 jmp rel32 jmp reg jmp WORD PTR jmp DWORD PTR call rel16 call r/m16 call rel32 call r/m32 Coding more than three branches in the same 16-byte code window may lead to conflicts in the branch target buffer. To avoid conflicts in the branch target buffer, space out branches such that three or fewer exist in a given 16-byte code window. For absolute optimal performance, try to limit branches to one per 16-byte code window. Avoid code sequences like the following: ALIGN 16 label3: call label1 ; 1st branch in 16-byte code window jc label3 ; 2nd branch in 16-byte code window call label2 ; 3rd branch in 16-byte code window jnzlabel4 ; 4th branch in 16-byte code window ; Cannot be predicted. If there is a jump table that contains many frequently executed branches, pad the table entries to 8 bytes each to assure that there are never more than three branches per 16-byte block of code. Only branches that have been taken at least once are entered into the dynamic branch prediction, and therefore only those branches count toward the three-branch limit. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #20 from rguenth at gcc dot gnu dot org 2009-05-13 13:31 --- Instruction decoders generally operate on whole cache-lines, so 16-byte chunk very very likely refers to a cache-line. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #19 from vvv at ru dot ru 2009-05-13 11:42 --- (In reply to comment #18) > No, .p2align is the right thing to do, given that GCC doesn't have 100% > accurate information about instruction sizes (for e.g. inline asms it can't > have, for > stuff where branch shortening can decrease the size doesn't have it until the > shortening branch phase which is too late for this machine reorg, and in other > cases the lengths are just upper bounds). Say .p2align 16,,5 says > insert a nop up to 5 bytes if you can reach the 16-byte boundary with it, > otherwise don't insert anything. But that necessarily means that there were > less than 11 bytes in the same 16 byte page and if the lower bound insn size > estimation determined that in 11 bytes you can't have 3 branch changing > instructions, you are fine. Breaking of fused compare and jump (32-bit code > only) is unfortunate, but inserting it before the cmp would mean often > unnecessarily large padding. You are rigth, if padding required for every 16-byte page with 4 branches on it. But Intel writes about "16-byte chunk", not "16-byte page". Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual: Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in a 16-byte chunk. IMHO, here chunk - memory range from x to x+10h, where x - _any_ address. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #18 from jakub at gcc dot gnu dot org 2009-05-13 08:30 --- No, .p2align is the right thing to do, given that GCC doesn't have 100% accurate information about instruction sizes (for e.g. inline asms it can't have, for stuff where branch shortening can decrease the size doesn't have it until the shortening branch phase which is too late for this machine reorg, and in other cases the lengths are just upper bounds). Say .p2align 16,,5 says insert a nop up to 5 bytes if you can reach the 16-byte boundary with it, otherwise don't insert anything. But that necessarily means that there were less than 11 bytes in the same 16 byte page and if the lower bound insn size estimation determined that in 11 bytes you can't have 3 branch changing instructions, you are fine. Breaking of fused compare and jump (32-bit code only) is unfortunate, but inserting it before the cmp would mean often unnecessarily large padding. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #17 from vvv at ru dot ru 2009-05-12 16:40 --- (In reply to comment #16) > Created an attachment (id=17783) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view) [edit] > gcc45-pr39942.patch > Patch that attempts to take into account .p2align directives that are emitted > for (some) CODE_LABELs and also the gen_align insns that the pass itself > inserts. For a CODE_LABEL, say .p2align 16,,10 means either that the .p2align > directive starts a new 16 byte page (then insns before it are never > interesting), or nothing was skipped because more than 10 bytes would need to > be skipped. But that means the current group could contain only 5 or less > bytes of instructions before the label, so again, we don't have to look at > instructions not in the last 5 bytes. > Another fix is that for MAX_SKIP < 7, ASM_OUTPUT_MAX_SKIP_ALIGN shouldn't emit > the second .p2align 3, which might (and often does) skip more than MAX_SKIP > bytes (up to 7). Nice path. Code looks better. It checked on Linux kernel 2.6.29.2. But 2 notes: 1.There is no garanty that .p2align will be translated to NOPs. Example: # cat test.c void f(int i) { if (i == 1) F(1); if (i == 2) F(2); if (i == 3) F(3); if (i == 4) F(4); if (i == 5) F(5); } # gcc -o test.s test.c -O2 -S # cat test.s .file "test.c" .text .p2align 4,,15 .globl f .type f, @function f: .LFB0: .cfi_startproc cmpl$1, %edi je .L7 cmpl$2, %edi je .L7 cmpl$3, %edi je .L7 cmpl$4, %edi .p2align 4,,5<--- attempt of padding je .L7 cmpl$5, %edi je .L7 rep ret .p2align 4,,10 .p2align 3 .L7: xorl%eax, %eax jmp F .cfi_endproc .LFE0: .size f, .-f .ident "GCC: (GNU) 4.5.0 20090512 (experimental)" .section.note.GNU-stack,"",@progbits # gcc -o test.out test.s -O2 -c # objdump -d test.out : 0: 83 ff 01cmp$0x1,%edi 3: 74 1b je 20 5: 83 ff 02cmp$0x2,%edi 8: 74 16 je 20 a: 83 ff 03cmp$0x3,%edi d: 74 11 je 20 f: 83 ff 04cmp$0x4,%edi 12: 74 0c je 20 < no NOP here 14: 83 ff 05cmp$0x5,%edi 17: 74 07 je 20 19: f3 c3 repz retq IMHO, better to insert not .p2align, but NOPs directly. ( I mean line - emit_insn_before (gen_align (GEN_INT (padsize)), insn); ) 2. IMHO, it's bad idea to insert somthing between CMP and conditional jmp. Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual >> 3.4.2.2 Optimizing for Macro-fusion >> Macro-fusion merges two instructions to a single μop. Intel Core >> Microarchitecture >> performs this hardware optimization under limited circumstances. >> The first instruction of the macro-fused pair must be a CMP or TEST >> instruction. This >> instruction can be REG-REG, REG-IMM, or a micro-fused REG-MEM comparison. The >> second instruction (adjacent in the instruction stream) should be a >> conditional >> branch. So if we need to insert NOPs, better to do it _before_ CMP. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #16 from jakub at gcc dot gnu dot org 2009-04-30 09:07 --- Created an attachment (id=17783) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view) gcc45-pr39942.patch Patch that attempts to take into account .p2align directives that are emitted for (some) CODE_LABELs and also the gen_align insns that the pass itself inserts. For a CODE_LABEL, say .p2align 16,,10 means either that the .p2align directive starts a new 16 byte page (then insns before it are never interesting), or nothing was skipped because more than 10 bytes would need to be skipped. But that means the current group could contain only 5 or less bytes of instructions before the label, so again, we don't have to look at instructions not in the last 5 bytes. Another fix is that for MAX_SKIP < 7, ASM_OUTPUT_MAX_SKIP_ALIGN shouldn't emit the second .p2align 3, which might (and often does) skip more than MAX_SKIP bytes (up to 7). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #15 from vvv at ru dot ru 2009-04-29 19:16 --- One more example 5-bytes nop between leaveq and retq. # cat test.c void wait_for_enter() { int u = getchar(); while (!u) u = getchar()-13; } main() { wait_for_enter(); } # gcc -o t.out test.c -O2 -march=core2 -fno-omit-frame-pointer # objdump -d t.out ... 00400540 : 400540: 55 push %rbp 400541: 31 c0 xor%eax,%eax 400543: 48 89 e5mov%rsp,%rbp 400546: e8 f5 fe ff ff callq 400440 40054b: 85 c0 test %eax,%eax 40054d: 75 13 jne400562 40054f: 90 nop 400550: 31 c0 xor%eax,%eax 400552: e8 e9 fe ff ff callq 400440 400557: 83 f8 0dcmp$0xd,%eax 40055a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 400560: 74 ee je 400550 400562: c9 leaveq 400563: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) <--NONOPTIMAL! 400568: c3 retq 400569: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 00400570 : 400570: 55 push %rbp 400571: 31 c0 xor%eax,%eax 400573: 48 89 e5mov%rsp,%rbp 400576: e8 c5 ff ff ff callq 400540 40057b: c9 leaveq 40057c: c3 retq 40057d: 90 nop 40057e: 90 nop 40057f: 90 nop So bug unresolved. -- vvv at ru dot ru changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID | http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #14 from jakub at gcc dot gnu dot org 2009-04-29 10:12 --- Also, couldn't we use the information computed by compute_alignments and assume CODE_LABELs are aligned? Probably would need to add label_to_max_skip (rtx) function to final.c, so that not just label_to_alignment, but also LABEL_TO_MAX_SKIP is available to backends. Then when we know for the label in the testcase that .p2align 4,,10 .p2align 3 then we know the 16 byte boundary is either at that label, or at most 5 bytes before it, so all we need is consider any jumps/calls in the last 5 bytes before the label. For min_insn_size, is it possible to find out for which non-jump/call insns get_attr_length might not be exact (i.e. be a maximum guess rather than guaranteed size (though of course, this is also just an optimization, so 100% guarantees aren't needed either))? If so, we could use get_attr_length for the insns where it is known to be exact... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #13 from jakub at gcc dot gnu dot org 2009-04-29 09:32 --- You are benchmarking something completely unrelated. What really matters is how code that has 4 branches/calls in one 16-byte block is able to predict all those branches. And Core2 similarly to various AMD CPUs is not able to predict them well. In the #c6 testcase it considers the je, call, jne and ret whether they can be in a 16 byte block or not. They can't, je is 2 bytes, call 5 bytes, leal 4 bytes (but gcc uses min_insn_size, which is 2 in this case), testl 2, jne 2, addq 4 (but again, min_insn_size is 2 in this case). min_insn_size seems to be very conservative, I guess teaching it about a bunch of prefixes couldn't hurt, for non-jump/call insns ATM it estimates just the displacement size, doesn't consider any prefixes (even those that really can't change after machine reorg), etc. -- jakub at gcc dot gnu dot org changed: What|Removed |Added CC||hubicka at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #12 from vvv at ru dot ru 2009-04-29 07:55 --- (In reply to comment #9) > So that explains it, Use -Os or attribute cold if you want NOPs to be gone. But my measurements on Core 2 Duo P8600 show that push %ebp mov %esp,%ebp leave ret _faster_ then push %ebp mov %esp,%ebp leave xchg %ax,%ax ret -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #11 from vvv at ru dot ru 2009-04-29 07:46 --- (In reply to comment #8) > From config/i386/i386.c: > /* AMD Athlon works faster >when RET is not destination of conditional jump or directly preceded >by other jump instruction. We avoid the penalty by inserting NOP just >before the RET instructions in such cases. */ > static void > ix86_pad_returns (void) > ... But I am using Core 2 Duo. Why we see multibyte nop, not single byte nop? Why if change line u = F(u)*3+1; to u = F(u)*4+1; or u = F(u); number of nops changed? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #10 from ubizjak at gmail dot com 2009-04-28 21:53 --- Actually, alignment is from ix86_avoid_jump_misspredicts, where: /* Look for all minimal intervals of instructions containing 4 jumps. The intervals are bounded by START and INSN. NBYTES is the total size of instructions in the interval including INSN and not including START. When the NBYTES is smaller than 16 bytes, it is possible that the end of START and INSN ends up in the same 16byte page. The smallest offset in the page INSN can start is the case where START ends on the offset 0. Offset of INSN is then NBYTES - sizeof (INSN). We add p2align to 16byte window with maxskip 17 - NBYTES + sizeof (INSN). */ So, this is by design. Use -Os if code size is important. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #9 from pinskia at gcc dot gnu dot org 2009-04-28 21:52 --- So that explains it, Use -Os or attribute cold if you want NOPs to be gone. -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Status|WAITING |RESOLVED Resolution||INVALID http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #8 from ubizjak at gmail dot com 2009-04-28 21:47 --- >From config/i386/i386.c: /* AMD Athlon works faster when RET is not destination of conditional jump or directly preceded by other jump instruction. We avoid the penalty by inserting NOP just before the RET instructions in such cases. */ static void ix86_pad_returns (void) ... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #6 from vvv at ru dot ru 2009-04-28 21:18 --- Let's compile file test.c //#file test.c extern int F(int m); void func(int x) { int u = F(x); while (u) u = F(u)*3+1; } # gcc -o t.out test.c -c -O2 # objdump -d t.out t.out: file format elf64-x86-64 Disassembly of section .text: : 0: 48 83 ec 08 sub$0x8,%rsp 4: e8 00 00 00 00 callq 9 9: 85 c0 test %eax,%eax b: 89 c7 mov%eax,%edi d: 74 0e je 1d f: 90 nop 10: e8 00 00 00 00 callq 15 15: 8d 7c 40 01 lea0x1(%rax,%rax,2),%edi 19: 85 ff test %edi,%edi 1b: 75 f3 jne10 1d: 48 83 c4 08 add$0x8,%rsp 21: 0f 1f 80 00 00 00 00nopl 0x0(%rax) < nonoptimal 28: c3 retq -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #7 from pinskia at gcc dot gnu dot org 2009-04-28 21:23 --- 4.1.2 produces: .L4: addq$8, %rsp .p2align 4,,2 ret While the trunk produces: .L1: addq$8, %rsp .p2align 4,,2 .p2align 3 ret -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #5 from ubizjak at gmail dot com 2009-04-28 17:37 --- Unfortunately, all code snippets and dumps are of no use. Please see http://gcc.gnu.org/bugs.html for the reason why. As an exercise, please compile *standalone* _preprocessed_ source you will create with -S added to your compile flags and count the number of .p2align directives in the code stream. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #4 from vvv at ru dot ru 2009-04-28 17:15 --- Created an attachment (id=1) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=1&action=view) Simple example from Linux See two functons: static void pre_schedule_rt static void switched_from_rt -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #3 from vvv at ru dot ru 2009-04-28 17:10 --- Additional examples from Linux Kernel 2.6.29.1: (Note: conditional statement at the end of all fuctions!) = linux/drivers/video/console/bitblit.c void fbcon_set_bitops(struct fbcon_ops *ops) { ops->bmove = bit_bmove; ops->clear = bit_clear; ops->putcs = bit_putcs; ops->clear_margins = bit_clear_margins; ops->cursor = bit_cursor; ops->update_start = bit_update_start; ops->rotate_font = NULL; if (ops->rotate) fbcon_set_rotate(ops); } 8020a5e0 : 8020a5e0: 55 push %rbp 8020a5e1: bf 01 00 00 00 mov$0x1,%edi 8020a5e6: 48 89 e5mov%rsp,%rbp 8020a5e9: e8 c2 fd 35 00 callq 8056a3b0 8020a5ee: 65 48 8b 04 25 10 00mov%gs:0x10,%rax 8020a5f5: 00 00 8020a5f7: 48 2d c8 1f 00 00 sub$0x1fc8,%rax 8020a5fd: f0 0f ba 28 10 lock btsl $0x10,(%rax) 8020a602: 19 d2 sbb%edx,%edx 8020a604: 85 d2 test %edx,%edx 8020a606: 75 0a jne8020a612 8020a608: 0f 20 e0mov%cr4,%rax 8020a60b: 48 83 c8 04 or $0x4,%rax 8020a60f: 0f 22 e0mov%rax,%cr4 8020a612: bf 01 00 00 00 mov$0x1,%edi 8020a617: e8 e4 fc 35 00 callq 8056a300 8020a61c: 65 48 8b 04 25 10 00mov%gs:0x10,%rax 8020a623: 00 00 8020a625: f6 80 38 e0 ff ff 08testb $0x8,-0x1fc8(%rax) 8020a62c: 75 02 jne8020a630 8020a62e: c9 leaveq 8020a62f: c3 retq 8020a630: e8 2b 99 35 00 callq 80563f60 8020a635: c9 leaveq 8020a636: 66 90 xchg %ax,%ax 8020a638: c3 retq == /arch/x86/kernel/io_delay.c void native_io_delay(void) { switch (io_delay_type) { default: case CONFIG_IO_DELAY_TYPE_0X80: asm volatile ("outb %al, $0x80"); break; case CONFIG_IO_DELAY_TYPE_0XED: asm volatile ("outb %al, $0xed"); break; case CONFIG_IO_DELAY_TYPE_UDELAY: /* * 2 usecs is an upper-bound for the outb delay but * note that udelay doesn't have the bus-level * side-effects that outb does, nor does udelay() have * precise timings during very early bootup (the delays * are shorter until calibrated): */ udelay(2); case CONFIG_IO_DELAY_TYPE_NONE: break; } } EXPORT_SYMBOL(native_io_delay); 802131e0 : 802131e0: 55 push %rbp 802131e1: 8b 05 3d b3 54 00 mov0x54b33d(%rip),%eax # 8075e524 802131e7: 48 89 e5mov%rsp,%rbp 802131ea: 83 f8 02cmp$0x2,%eax 802131ed: 74 29 je 80213218 802131ef: 83 f8 03cmp$0x3,%eax 802131f2: 74 06 je 802131fa 802131f4: ff c8 dec%eax 802131f6: 74 10 je 80213208 802131f8: e6 80 out%al,$0x80 802131fa: c9 leaveq 802131fb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 80213200: c3 retq 80213201: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 80213208: e6 ed out%al,$0xed 8021320a: c9 leaveq 8021320b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 80213210: c3 retq 80213211: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 80213218: bf 8e 21 00 00 mov$0x218e,%edi 8021321d: 0f 1f 00nopl (%rax) 80213220: e8 fb ac 1e 00 callq 803fdf20 <__const_udelay> 80213225: c9 leaveq 80213226: 66 90 xchg %ax,%ax 80213228: c3 retq === arch/x86/mm/ioremap.c int ioremap_change_attr(unsigned long vaddr, unsigned long size, unsig
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #2 from vvv at ru dot ru 2009-04-28 17:04 --- Created an attachment (id=17776) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17776&action=view) Source file from Linx Kernel 2.6.29.1 See static void set_blitting_type -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
--- Comment #1 from pinskia at gcc dot gnu dot org 2009-04-28 13:42 --- Can you provide the preprocessed source which contains set_blitting_type? -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |WAITING Component|c |target GCC target triplet||x86_64-linux-gnu Keywords||missed-optimization http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942