[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2020-04-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #53 fr

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-21 Thread jakub at gcc dot gnu dot org
--- Comment #52 from jakub at gcc dot gnu dot org 2009-05-21 13:26 --- Subject: Bug 39942 Author: jakub Date: Thu May 21 13:26:13 2009 New Revision: 147766 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147766 Log: PR target/39942 * config/i386/x86-64.h (ASM_OUTP

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-21 Thread jakub at gcc dot gnu dot org
--- Comment #51 from jakub at gcc dot gnu dot org 2009-05-21 13:21 --- Subject: Bug 39942 Author: jakub Date: Thu May 21 13:21:30 2009 New Revision: 147765 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147765 Log: PR target/39942 * config/i386/x86-64.h (ASM_OUTP

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-20 Thread jakub at gcc dot gnu dot org
--- Comment #50 from jakub at gcc dot gnu dot org 2009-05-20 22:09 --- nopl 0x0(%rax,%rax,1) and nopw 0x0(%rax,%rax,1) aren't padding (though, it has been added in this case for label alignment or function entry alignment, not to avoid 4+ jumps in one 16byte page)? Anyway, you want t

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-20 Thread vvv at ru dot ru
--- Comment #49 from vvv at ru dot ru 2009-05-20 21:38 --- (In reply to comment #48) How this patches work? Is it required some special options? # /media/disk-1/B/bin/gcc --version gcc (GCC) 4.5.0 20090520 (experimental) # cat test.c void f(int i) { if (i == 1) F(1); if

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-18 Thread hjl at gcc dot gnu dot org
--- Comment #48 from hjl at gcc dot gnu dot org 2009-05-18 17:21 --- Subject: Bug 39942 Author: hjl Date: Mon May 18 17:21:13 2009 New Revision: 147671 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147671 Log: 2009-05-18 H.J. Lu PR target/39942 * config/i386

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-16 Thread jakub at gcc dot gnu dot org
--- Comment #47 from jakub at gcc dot gnu dot org 2009-05-16 07:12 --- Subject: Bug 39942 Author: jakub Date: Sat May 16 07:12:02 2009 New Revision: 147607 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147607 Log: PR target/39942 * final.c (label_to_max_skip): N

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-16 Thread jakub at gcc dot gnu dot org
--- Comment #46 from jakub at gcc dot gnu dot org 2009-05-16 07:10 --- Subject: Bug 39942 Author: jakub Date: Sat May 16 07:09:52 2009 New Revision: 147606 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147606 Log: PR target/39942 * config/i386/x86-64.h (ASM_OUTP

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org
--- Comment #45 from jakub at gcc dot gnu dot org 2009-05-16 06:37 --- cmpl $1, %eax does have the modrm byte: 83 f8 01 cmp$0x1,%eax compared to cmpl $0xdeadbeef, $eax which doesn't have it: 3d ef be ad de cmp$0xdeadbeef,%eax So I think what I wrote is more prec

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread hjl dot tools at gmail dot com
--- Comment #44 from hjl dot tools at gmail dot com 2009-05-15 23:05 --- (In reply to comment #41) > The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me > look at one of the cases which was wrong and the problem is that cmp $0x1d, > %al > has too large get_at

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org
--- Comment #43 from jakub at gcc dot gnu dot org 2009-05-15 18:23 --- Some code size growth is from enlarged get_attr_modrm though, 292 bytes for 64-bit, 1338 bytes for 32-bit. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org
--- Comment #42 from jakub at gcc dot gnu dot org 2009-05-15 18:18 --- Sizes with the #c41 patch together with the 3 patches mentioned in #c31 are: 0x890038 (64-bit) and 0x8ce08c (32-bit), 44 bad 16-byte pages in 64-bit, 35 in 32-bit. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org
--- Comment #41 from jakub at gcc dot gnu dot org 2009-05-15 16:24 --- The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me look at one of the cases which was wrong and the problem is that cmp $0x1d, %al has too large get_attr_lenght (insn) returned, 3 instead o

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread hjl dot tools at gmail dot com
--- Comment #40 from hjl dot tools at gmail dot com 2009-05-15 14:35 --- (In reply to comment #37) > This patch looks very wrong. It assumes that min_insn_size gives exact insn > sizes (current min_insn_size is very far from that, but even get_attr_length > isn't exact), doesn't take i

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org
--- Comment #39 from jakub at gcc dot gnu dot org 2009-05-15 12:12 --- Created an attachment (id=17874) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17874&action=view) test4jmp.sh -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org
--- Comment #38 from jakub at gcc dot gnu dot org 2009-05-15 12:11 --- To extend #c31, I've also built the same tree with another patch which made sure ix86_avoid_jump_mispredicts is never called (change "&& optimize" into "&& optimize > 4" in ix86_reorg). cc1plus sizes were then 0x88d6

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org
--- Comment #37 from jakub at gcc dot gnu dot org 2009-05-15 07:56 --- This patch looks very wrong. It assumes that min_insn_size gives exact insn sizes (current min_insn_size is very far from that, but even get_attr_length isn't exact), doesn't take into account label alignments nor br

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread hjl dot tools at gmail dot com
--- Comment #36 from hjl dot tools at gmail dot com 2009-05-15 04:32 --- Created an attachment (id=17871) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17871&action=view) An updated patch A few comments: 1. 3 branch limit is per 16byte page, not 16byte window. 2. We should allow

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread hjl dot tools at gmail dot com
--- Comment #35 from hjl dot tools at gmail dot com 2009-05-15 02:23 --- Created an attachment (id=17870) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17870&action=view) A patch This patch limits 3 branches per 16byte page. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=399

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread vvv at ru dot ru
--- Comment #34 from vvv at ru dot ru 2009-05-14 19:43 --- (In reply to comment #32) > Please make sure that you only test nop paddings for branch insns, > not nop paddings for branch targets, which prefer 16byte alignment. Additional tests (for Core2) results: 1. Execution time don't d

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread hjl dot tools at gmail dot com
--- Comment #33 from hjl dot tools at gmail dot com 2009-05-14 18:37 --- (In reply to comment #20) > Instruction decoders generally operate on whole cache-lines, so 16-byte chunk > very very likely refers to a cache-line. > That is true. For Intel CPUs, "16-bytes chunk" means memory r

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread hjl dot tools at gmail dot com
--- Comment #32 from hjl dot tools at gmail dot com 2009-05-14 15:58 --- (In reply to comment #30) > Created an attachment (id=17863) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view) [edit] > Testing tool. > Please make sure that you only test nop paddings for br

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread jakub at gcc dot gnu dot org
--- Comment #31 from jakub at gcc dot gnu dot org 2009-05-14 15:15 --- Some -O2 code size data from today's trunk bootstraps. The first .text line is always vanilla bootstrap, the second one with http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html only, the third one with that patch

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread vvv at ru dot ru
--- Comment #30 from vvv at ru dot ru 2009-05-14 09:01 --- Created an attachment (id=17863) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view) Testing tool. Here is results of my testing. Code: align 128 test_cikl: rept 14 ; 14 if SH=0, 15 if SH=1,

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread hjl dot tools at gmail dot com
--- Comment #29 from hjl dot tools at gmail dot com 2009-05-13 21:44 --- Created an attachment (id=17858) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17858&action=view) Impact of X86_TUNE_FOUR_JUMP_LIMIT on SPEC CPU 2K This is my old data of X86_TUNE_FOUR_JUMP_LIMIT on Penryn a

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru
--- Comment #28 from vvv at ru dot ru 2009-05-13 19:18 --- (In reply to comment #24) > Using padding to avoid 4 branches in 16byte chunk may not be a good idea since > it will increase code size. It's enough only one byte NOP per 16-byte chunk for padding. But, IMHO, four branches in 16

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread jakub at gcc dot gnu dot org
--- Comment #27 from jakub at gcc dot gnu dot org 2009-05-13 19:08 --- If inserting the padding isn't worth it for say core2, m_CORE2 could be dropped from X86_TUNE_FOUR_JUMP_LIMIT, but certainly it would be interesting to see SPEC numbers backing that up. Similarly for AMD CPUs, and if

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru
--- Comment #26 from vvv at ru dot ru 2009-05-13 19:05 --- (In reply to comment #23) > Note that we need something that works for the generic model as well, which in > this case looks like it is the same as for AMD models. There is processor property TARGET_FOUR_JUMP_LIMIT, may be creat

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru
--- Comment #25 from vvv at ru dot ru 2009-05-13 18:56 --- (In reply to comment #22) > CCing H.J for Intel optimization issues. VVV> 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but VVV> Intel limitation for 16-bytes chunk (memory range - +10

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread hjl dot tools at gmail dot com
--- Comment #24 from hjl dot tools at gmail dot com 2009-05-13 18:45 --- Using padding to avoid 4 branches in 16byte chunk may not be a good idea since it will increase code size. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread rguenth at gcc dot gnu dot org
--- Comment #23 from rguenth at gcc dot gnu dot org 2009-05-13 18:34 --- Note that we need something that works for the generic model as well, which in this case looks like it is the same as for AMD models. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread ubizjak at gmail dot com
--- Comment #22 from ubizjak at gmail dot com 2009-05-13 18:22 --- (In reply to comment #21) > I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD > OpteronTM > processors, but it is nonoptimal for Intel processors. Because: ... CCing H.J for Intel optimization issue

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru
--- Comment #21 from vvv at ru dot ru 2009-05-13 17:13 --- I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM processors, but it is nonoptimal for Intel processors. Because: 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but Intel li

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread rguenth at gcc dot gnu dot org
--- Comment #20 from rguenth at gcc dot gnu dot org 2009-05-13 13:31 --- Instruction decoders generally operate on whole cache-lines, so 16-byte chunk very very likely refers to a cache-line. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru
--- Comment #19 from vvv at ru dot ru 2009-05-13 11:42 --- (In reply to comment #18) > No, .p2align is the right thing to do, given that GCC doesn't have 100% > accurate information about instruction sizes (for e.g. inline asms it can't > have, for > stuff where branch shortening can dec

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread jakub at gcc dot gnu dot org
--- Comment #18 from jakub at gcc dot gnu dot org 2009-05-13 08:30 --- No, .p2align is the right thing to do, given that GCC doesn't have 100% accurate information about instruction sizes (for e.g. inline asms it can't have, for stuff where branch shortening can decrease the size doesn't

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-12 Thread vvv at ru dot ru
--- Comment #17 from vvv at ru dot ru 2009-05-12 16:40 --- (In reply to comment #16) > Created an attachment (id=17783) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view) [edit] > gcc45-pr39942.patch > Patch that attempts to take into account .p2align directives that

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-30 Thread jakub at gcc dot gnu dot org
--- Comment #16 from jakub at gcc dot gnu dot org 2009-04-30 09:07 --- Created an attachment (id=17783) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view) gcc45-pr39942.patch Patch that attempts to take into account .p2align directives that are emitted for (some) COD

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread vvv at ru dot ru
--- Comment #15 from vvv at ru dot ru 2009-04-29 19:16 --- One more example 5-bytes nop between leaveq and retq. # cat test.c void wait_for_enter() { int u = getchar(); while (!u) u = getchar()-13; } main() { wait_for_enter(); } # gcc -o t.out test.c -O

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread jakub at gcc dot gnu dot org
--- Comment #14 from jakub at gcc dot gnu dot org 2009-04-29 10:12 --- Also, couldn't we use the information computed by compute_alignments and assume CODE_LABELs are aligned? Probably would need to add label_to_max_skip (rtx) function to final.c, so that not just label_to_alignment, but

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread jakub at gcc dot gnu dot org
--- Comment #13 from jakub at gcc dot gnu dot org 2009-04-29 09:32 --- You are benchmarking something completely unrelated. What really matters is how code that has 4 branches/calls in one 16-byte block is able to predict all those branches. And Core2 similarly to various AMD CPUs is no

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread vvv at ru dot ru
--- Comment #12 from vvv at ru dot ru 2009-04-29 07:55 --- (In reply to comment #9) > So that explains it, Use -Os or attribute cold if you want NOPs to be gone. But my measurements on Core 2 Duo P8600 show that push %ebp mov %esp,%ebp leave ret _faster_ then push %ebp mov %esp,%eb

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread vvv at ru dot ru
--- Comment #11 from vvv at ru dot ru 2009-04-29 07:46 --- (In reply to comment #8) > From config/i386/i386.c: > /* AMD Athlon works faster >when RET is not destination of conditional jump or directly preceded >by other jump instruction. We avoid the penalty by inserting NOP jus

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread ubizjak at gmail dot com
--- Comment #10 from ubizjak at gmail dot com 2009-04-28 21:53 --- Actually, alignment is from ix86_avoid_jump_misspredicts, where: /* Look for all minimal intervals of instructions containing 4 jumps. The intervals are bounded by START and INSN. NBYTES is the total size of

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread pinskia at gcc dot gnu dot org
--- Comment #9 from pinskia at gcc dot gnu dot org 2009-04-28 21:52 --- So that explains it, Use -Os or attribute cold if you want NOPs to be gone. -- pinskia at gcc dot gnu dot org changed: What|Removed |Added -

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread ubizjak at gmail dot com
--- Comment #8 from ubizjak at gmail dot com 2009-04-28 21:47 --- >From config/i386/i386.c: /* AMD Athlon works faster when RET is not destination of conditional jump or directly preceded by other jump instruction. We avoid the penalty by inserting NOP just before the RET inst

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru
--- Comment #6 from vvv at ru dot ru 2009-04-28 21:18 --- Let's compile file test.c //#file test.c extern int F(int m); void func(int x) { int u = F(x); while (u) u = F(u)*3+1; } # gcc -o t.out test.c -c -O2 # objdump -d t.out t.out: file format e

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread pinskia at gcc dot gnu dot org
--- Comment #7 from pinskia at gcc dot gnu dot org 2009-04-28 21:23 --- 4.1.2 produces: .L4: addq$8, %rsp .p2align 4,,2 ret While the trunk produces: .L1: addq$8, %rsp .p2align 4,,2 .p2align 3 ret -- http://gcc.gnu.org

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread ubizjak at gmail dot com
--- Comment #5 from ubizjak at gmail dot com 2009-04-28 17:37 --- Unfortunately, all code snippets and dumps are of no use. Please see http://gcc.gnu.org/bugs.html for the reason why. As an exercise, please compile *standalone* _preprocessed_ source you will create with -S added to your

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru
--- Comment #4 from vvv at ru dot ru 2009-04-28 17:15 --- Created an attachment (id=1) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=1&action=view) Simple example from Linux See two functons: static void pre_schedule_rt static void switched_from_rt -- http://gcc.gnu.o

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru
--- Comment #3 from vvv at ru dot ru 2009-04-28 17:10 --- Additional examples from Linux Kernel 2.6.29.1: (Note: conditional statement at the end of all fuctions!) = linux/drivers/video/console/bitblit.c void fbcon_set_bitops(struct fbcon_ops *ops) { ops->bmove

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru
--- Comment #2 from vvv at ru dot ru 2009-04-28 17:04 --- Created an attachment (id=17776) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17776&action=view) Source file from Linx Kernel 2.6.29.1 See static void set_blitting_type -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread pinskia at gcc dot gnu dot org
--- Comment #1 from pinskia at gcc dot gnu dot org 2009-04-28 13:42 --- Can you provide the preprocessed source which contains set_blitting_type? -- pinskia at gcc dot gnu dot org changed: What|Removed |Added ---