https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #53
--- Comment #51 from jakub at gcc dot gnu dot org 2009-05-21 13:21 ---
Subject: Bug 39942
Author: jakub
Date: Thu May 21 13:21:30 2009
New Revision: 147765
URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=147765
Log:
PR target/39942
* config/i386/x86-64.h
--- Comment #52 from jakub at gcc dot gnu dot org 2009-05-21 13:26 ---
Subject: Bug 39942
Author: jakub
Date: Thu May 21 13:26:13 2009
New Revision: 147766
URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=147766
Log:
PR target/39942
* config/i386/x86-64.h
--- Comment #49 from vvv at ru dot ru 2009-05-20 21:38 ---
(In reply to comment #48)
How this patches work? Is it required some special options?
# /media/disk-1/B/bin/gcc --version
gcc (GCC) 4.5.0 20090520 (experimental)
# cat test.c
void f(int i)
{
if (i == 1) F(1);
if
--- Comment #50 from jakub at gcc dot gnu dot org 2009-05-20 22:09 ---
nopl 0x0(%rax,%rax,1) and nopw 0x0(%rax,%rax,1) aren't padding (though, it
has been added in this case for label alignment or function entry alignment,
not to avoid 4+ jumps in one 16byte page)?
Anyway, you want
--- Comment #48 from hjl at gcc dot gnu dot org 2009-05-18 17:21 ---
Subject: Bug 39942
Author: hjl
Date: Mon May 18 17:21:13 2009
New Revision: 147671
URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=147671
Log:
2009-05-18 H.J. Lu hongjiu...@intel.com
PR target/39942
--- Comment #47 from jakub at gcc dot gnu dot org 2009-05-16 07:12 ---
Subject: Bug 39942
Author: jakub
Date: Sat May 16 07:12:02 2009
New Revision: 147607
URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=147607
Log:
PR target/39942
* final.c (label_to_max_skip):
--- Comment #37 from jakub at gcc dot gnu dot org 2009-05-15 07:56 ---
This patch looks very wrong. It assumes that min_insn_size gives exact insn
sizes (current min_insn_size is very far from that, but even get_attr_length
isn't exact), doesn't take into account label alignments nor
--- Comment #38 from jakub at gcc dot gnu dot org 2009-05-15 12:11 ---
To extend #c31, I've also built the same tree with another patch which made
sure ix86_avoid_jump_mispredicts is never called (change optimize into
optimize 4 in ix86_reorg). cc1plus sizes were then
0x88d6d8 bytes
--- Comment #39 from jakub at gcc dot gnu dot org 2009-05-15 12:12 ---
Created an attachment (id=17874)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17874action=view)
test4jmp.sh
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #40 from hjl dot tools at gmail dot com 2009-05-15 14:35
---
(In reply to comment #37)
This patch looks very wrong. It assumes that min_insn_size gives exact insn
sizes (current min_insn_size is very far from that, but even get_attr_length
isn't exact), doesn't take
--- Comment #41 from jakub at gcc dot gnu dot org 2009-05-15 16:24 ---
The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me
look at one of the cases which was wrong and the problem is that cmp $0x1d, %al
has too large get_attr_lenght (insn) returned, 3 instead
--- Comment #42 from jakub at gcc dot gnu dot org 2009-05-15 18:18 ---
Sizes with the #c41 patch together with the 3 patches mentioned in #c31 are:
0x890038 (64-bit) and 0x8ce08c (32-bit), 44 bad 16-byte pages in 64-bit, 35 in
32-bit.
--
--- Comment #43 from jakub at gcc dot gnu dot org 2009-05-15 18:23 ---
Some code size growth is from enlarged get_attr_modrm though, 292 bytes for
64-bit, 1338 bytes for 32-bit.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #44 from hjl dot tools at gmail dot com 2009-05-15 23:05
---
(In reply to comment #41)
The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me
look at one of the cases which was wrong and the problem is that cmp $0x1d,
%al
has too large
--- Comment #30 from vvv at ru dot ru 2009-05-14 09:01 ---
Created an attachment (id=17863)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863action=view)
Testing tool.
Here is results of my testing.
Code:
align 128
test_cikl:
rept 14 ; 14 if SH=0, 15 if SH=1,
--- Comment #31 from jakub at gcc dot gnu dot org 2009-05-14 15:15 ---
Some -O2 code size data from today's trunk bootstraps. The first .text line
is always vanilla bootstrap, the second one with
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html
only, the third one with that
--- Comment #32 from hjl dot tools at gmail dot com 2009-05-14 15:58
---
(In reply to comment #30)
Created an attachment (id=17863)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863action=view) [edit]
Testing tool.
Please make sure that you only test nop paddings for branch
--- Comment #33 from hjl dot tools at gmail dot com 2009-05-14 18:37
---
(In reply to comment #20)
Instruction decoders generally operate on whole cache-lines, so 16-byte chunk
very very likely refers to a cache-line.
That is true. For Intel CPUs, 16-bytes chunk means memory range
--- Comment #34 from vvv at ru dot ru 2009-05-14 19:43 ---
(In reply to comment #32)
Please make sure that you only test nop paddings for branch insns,
not nop paddings for branch targets, which prefer 16byte alignment.
Additional tests (for Core2) results:
1. Execution time don't
--- Comment #18 from jakub at gcc dot gnu dot org 2009-05-13 08:30 ---
No, .p2align is the right thing to do, given that GCC doesn't have 100%
accurate information about instruction sizes (for e.g. inline asms it can't
have, for
stuff where branch shortening can decrease the size
--- Comment #19 from vvv at ru dot ru 2009-05-13 11:42 ---
(In reply to comment #18)
No, .p2align is the right thing to do, given that GCC doesn't have 100%
accurate information about instruction sizes (for e.g. inline asms it can't
have, for
stuff where branch shortening can
--- Comment #20 from rguenth at gcc dot gnu dot org 2009-05-13 13:31
---
Instruction decoders generally operate on whole cache-lines, so 16-byte chunk
very very likely refers to a cache-line.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #21 from vvv at ru dot ru 2009-05-13 17:13 ---
I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
processors, but it is nonoptimal for Intel processors. Because:
1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but
Intel
--- Comment #22 from ubizjak at gmail dot com 2009-05-13 18:22 ---
(In reply to comment #21)
I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD
OpteronTM
processors, but it is nonoptimal for Intel processors. Because:
...
CCing H.J for Intel optimization issues.
--- Comment #23 from rguenth at gcc dot gnu dot org 2009-05-13 18:34
---
Note that we need something that works for the generic model as well, which in
this case looks like it is the same as for AMD models.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #24 from hjl dot tools at gmail dot com 2009-05-13 18:45
---
Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
it will increase code size.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #25 from vvv at ru dot ru 2009-05-13 18:56 ---
(In reply to comment #22)
CCing H.J for Intel optimization issues.
VVV 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),
but
VVV Intel limitation for 16-bytes chunk (memory range -
+10h)
--- Comment #26 from vvv at ru dot ru 2009-05-13 19:05 ---
(In reply to comment #23)
Note that we need something that works for the generic model as well, which in
this case looks like it is the same as for AMD models.
There is processor property TARGET_FOUR_JUMP_LIMIT, may be create
--- Comment #27 from jakub at gcc dot gnu dot org 2009-05-13 19:08 ---
If inserting the padding isn't worth it for say core2, m_CORE2 could be dropped
from X86_TUNE_FOUR_JUMP_LIMIT, but certainly it would be interesting to see
SPEC numbers backing that up. Similarly for AMD CPUs, and
--- Comment #28 from vvv at ru dot ru 2009-05-13 19:18 ---
(In reply to comment #24)
Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
it will increase code size.
It's enough only one byte NOP per 16-byte chunk for padding. But, IMHO, four
branches in 16
--- Comment #29 from hjl dot tools at gmail dot com 2009-05-13 21:44
---
Created an attachment (id=17858)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17858action=view)
Impact of X86_TUNE_FOUR_JUMP_LIMIT on SPEC CPU 2K
This is my old data of X86_TUNE_FOUR_JUMP_LIMIT on Penryn
--- Comment #17 from vvv at ru dot ru 2009-05-12 16:40 ---
(In reply to comment #16)
Created an attachment (id=17783)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783action=view) [edit]
gcc45-pr39942.patch
Patch that attempts to take into account .p2align directives that are
--- Comment #16 from jakub at gcc dot gnu dot org 2009-04-30 09:07 ---
Created an attachment (id=17783)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783action=view)
gcc45-pr39942.patch
Patch that attempts to take into account .p2align directives that are emitted
for (some)
--- Comment #11 from vvv at ru dot ru 2009-04-29 07:46 ---
(In reply to comment #8)
From config/i386/i386.c:
/* AMD Athlon works faster
when RET is not destination of conditional jump or directly preceded
by other jump instruction. We avoid the penalty by inserting NOP just
--- Comment #12 from vvv at ru dot ru 2009-04-29 07:55 ---
(In reply to comment #9)
So that explains it, Use -Os or attribute cold if you want NOPs to be gone.
But my measurements on Core 2 Duo P8600 show that
push %ebp
mov %esp,%ebp
leave
ret
_faster_ then
push %ebp
mov
--- Comment #13 from jakub at gcc dot gnu dot org 2009-04-29 09:32 ---
You are benchmarking something completely unrelated.
What really matters is how code that has 4 branches/calls in one 16-byte block
is able to predict all those branches. And Core2 similarly to various AMD CPUs
is
--- Comment #14 from jakub at gcc dot gnu dot org 2009-04-29 10:12 ---
Also, couldn't we use the information computed by compute_alignments and
assume CODE_LABELs are aligned?
Probably would need to add label_to_max_skip (rtx) function to final.c,
so that not just label_to_alignment,
--- Comment #15 from vvv at ru dot ru 2009-04-29 19:16 ---
One more example 5-bytes nop between leaveq and retq.
# cat test.c
void wait_for_enter()
{
int u = getchar();
while (!u)
u = getchar()-13;
}
main()
{
wait_for_enter();
}
# gcc -o t.out test.c
--- Comment #1 from pinskia at gcc dot gnu dot org 2009-04-28 13:42 ---
Can you provide the preprocessed source which contains set_blitting_type?
--
pinskia at gcc dot gnu dot org changed:
What|Removed |Added
--- Comment #2 from vvv at ru dot ru 2009-04-28 17:04 ---
Created an attachment (id=17776)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17776action=view)
Source file from Linx Kernel 2.6.29.1
See static void set_blitting_type
--
--- Comment #3 from vvv at ru dot ru 2009-04-28 17:10 ---
Additional examples from Linux Kernel 2.6.29.1:
(Note: conditional statement at the end of all fuctions!)
=
linux/drivers/video/console/bitblit.c
void fbcon_set_bitops(struct fbcon_ops *ops)
{
ops-bmove
--- Comment #4 from vvv at ru dot ru 2009-04-28 17:15 ---
Created an attachment (id=1)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=1action=view)
Simple example from Linux
See two functons:
static void pre_schedule_rt
static void switched_from_rt
--
--- Comment #5 from ubizjak at gmail dot com 2009-04-28 17:37 ---
Unfortunately, all code snippets and dumps are of no use. Please see
http://gcc.gnu.org/bugs.html for the reason why.
As an exercise, please compile *standalone* _preprocessed_ source you will
create with -S added to
--- Comment #7 from pinskia at gcc dot gnu dot org 2009-04-28 21:23 ---
4.1.2 produces:
.L4:
addq$8, %rsp
.p2align 4,,2
ret
While the trunk produces:
.L1:
addq$8, %rsp
.p2align 4,,2
.p2align 3
ret
--
--- Comment #6 from vvv at ru dot ru 2009-04-28 21:18 ---
Let's compile file test.c
//#file test.c
extern int F(int m);
void func(int x)
{
int u = F(x);
while (u)
u = F(u)*3+1;
}
# gcc -o t.out test.c -c -O2
# objdump -d t.out
t.out: file format
--- Comment #8 from ubizjak at gmail dot com 2009-04-28 21:47 ---
From config/i386/i386.c:
/* AMD Athlon works faster
when RET is not destination of conditional jump or directly preceded
by other jump instruction. We avoid the penalty by inserting NOP just
before the RET
--- Comment #9 from pinskia at gcc dot gnu dot org 2009-04-28 21:52 ---
So that explains it, Use -Os or attribute cold if you want NOPs to be gone.
--
pinskia at gcc dot gnu dot org changed:
What|Removed |Added
--- Comment #10 from ubizjak at gmail dot com 2009-04-28 21:53 ---
Actually, alignment is from ix86_avoid_jump_misspredicts, where:
/* Look for all minimal intervals of instructions containing 4 jumps.
The intervals are bounded by START and INSN. NBYTES is the total
size
49 matches
Mail list logo