https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #53 fr
--- Comment #52 from jakub at gcc dot gnu dot org 2009-05-21 13:26 ---
Subject: Bug 39942
Author: jakub
Date: Thu May 21 13:26:13 2009
New Revision: 147766
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147766
Log:
PR target/39942
* config/i386/x86-64.h (ASM_OUTP
--- Comment #51 from jakub at gcc dot gnu dot org 2009-05-21 13:21 ---
Subject: Bug 39942
Author: jakub
Date: Thu May 21 13:21:30 2009
New Revision: 147765
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147765
Log:
PR target/39942
* config/i386/x86-64.h (ASM_OUTP
--- Comment #50 from jakub at gcc dot gnu dot org 2009-05-20 22:09 ---
nopl 0x0(%rax,%rax,1) and nopw 0x0(%rax,%rax,1) aren't padding (though, it
has been added in this case for label alignment or function entry alignment,
not to avoid 4+ jumps in one 16byte page)?
Anyway, you want t
--- Comment #49 from vvv at ru dot ru 2009-05-20 21:38 ---
(In reply to comment #48)
How this patches work? Is it required some special options?
# /media/disk-1/B/bin/gcc --version
gcc (GCC) 4.5.0 20090520 (experimental)
# cat test.c
void f(int i)
{
if (i == 1) F(1);
if
--- Comment #48 from hjl at gcc dot gnu dot org 2009-05-18 17:21 ---
Subject: Bug 39942
Author: hjl
Date: Mon May 18 17:21:13 2009
New Revision: 147671
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147671
Log:
2009-05-18 H.J. Lu
PR target/39942
* config/i386
--- Comment #47 from jakub at gcc dot gnu dot org 2009-05-16 07:12 ---
Subject: Bug 39942
Author: jakub
Date: Sat May 16 07:12:02 2009
New Revision: 147607
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147607
Log:
PR target/39942
* final.c (label_to_max_skip): N
--- Comment #46 from jakub at gcc dot gnu dot org 2009-05-16 07:10 ---
Subject: Bug 39942
Author: jakub
Date: Sat May 16 07:09:52 2009
New Revision: 147606
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147606
Log:
PR target/39942
* config/i386/x86-64.h (ASM_OUTP
--- Comment #45 from jakub at gcc dot gnu dot org 2009-05-16 06:37 ---
cmpl $1, %eax does have the modrm byte:
83 f8 01 cmp$0x1,%eax
compared to cmpl $0xdeadbeef, $eax which doesn't have it:
3d ef be ad de cmp$0xdeadbeef,%eax
So I think what I wrote is more prec
--- Comment #44 from hjl dot tools at gmail dot com 2009-05-15 23:05
---
(In reply to comment #41)
> The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me
> look at one of the cases which was wrong and the problem is that cmp $0x1d,
> %al
> has too large get_at
--- Comment #43 from jakub at gcc dot gnu dot org 2009-05-15 18:23 ---
Some code size growth is from enlarged get_attr_modrm though, 292 bytes for
64-bit, 1338 bytes for 32-bit.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #42 from jakub at gcc dot gnu dot org 2009-05-15 18:18 ---
Sizes with the #c41 patch together with the 3 patches mentioned in #c31 are:
0x890038 (64-bit) and 0x8ce08c (32-bit), 44 bad 16-byte pages in 64-bit, 35 in
32-bit.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3
--- Comment #41 from jakub at gcc dot gnu dot org 2009-05-15 16:24 ---
The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me
look at one of the cases which was wrong and the problem is that cmp $0x1d, %al
has too large get_attr_lenght (insn) returned, 3 instead o
--- Comment #40 from hjl dot tools at gmail dot com 2009-05-15 14:35
---
(In reply to comment #37)
> This patch looks very wrong. It assumes that min_insn_size gives exact insn
> sizes (current min_insn_size is very far from that, but even get_attr_length
> isn't exact), doesn't take i
--- Comment #39 from jakub at gcc dot gnu dot org 2009-05-15 12:12 ---
Created an attachment (id=17874)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17874&action=view)
test4jmp.sh
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #38 from jakub at gcc dot gnu dot org 2009-05-15 12:11 ---
To extend #c31, I've also built the same tree with another patch which made
sure ix86_avoid_jump_mispredicts is never called (change "&& optimize" into "&&
optimize > 4" in ix86_reorg). cc1plus sizes were then
0x88d6
--- Comment #37 from jakub at gcc dot gnu dot org 2009-05-15 07:56 ---
This patch looks very wrong. It assumes that min_insn_size gives exact insn
sizes (current min_insn_size is very far from that, but even get_attr_length
isn't exact), doesn't take into account label alignments nor br
--- Comment #36 from hjl dot tools at gmail dot com 2009-05-15 04:32
---
Created an attachment (id=17871)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17871&action=view)
An updated patch
A few comments:
1. 3 branch limit is per 16byte page, not 16byte window.
2. We should allow
--- Comment #35 from hjl dot tools at gmail dot com 2009-05-15 02:23
---
Created an attachment (id=17870)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17870&action=view)
A patch
This patch limits 3 branches per 16byte page.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=399
--- Comment #34 from vvv at ru dot ru 2009-05-14 19:43 ---
(In reply to comment #32)
> Please make sure that you only test nop paddings for branch insns,
> not nop paddings for branch targets, which prefer 16byte alignment.
Additional tests (for Core2) results:
1. Execution time don't d
--- Comment #33 from hjl dot tools at gmail dot com 2009-05-14 18:37
---
(In reply to comment #20)
> Instruction decoders generally operate on whole cache-lines, so 16-byte chunk
> very very likely refers to a cache-line.
>
That is true. For Intel CPUs, "16-bytes chunk" means memory r
--- Comment #32 from hjl dot tools at gmail dot com 2009-05-14 15:58
---
(In reply to comment #30)
> Created an attachment (id=17863)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view) [edit]
> Testing tool.
>
Please make sure that you only test nop paddings for br
--- Comment #31 from jakub at gcc dot gnu dot org 2009-05-14 15:15 ---
Some -O2 code size data from today's trunk bootstraps. The first .text line
is always vanilla bootstrap, the second one with
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html
only, the third one with that patch
--- Comment #30 from vvv at ru dot ru 2009-05-14 09:01 ---
Created an attachment (id=17863)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view)
Testing tool.
Here is results of my testing.
Code:
align 128
test_cikl:
rept 14 ; 14 if SH=0, 15 if SH=1,
--- Comment #29 from hjl dot tools at gmail dot com 2009-05-13 21:44
---
Created an attachment (id=17858)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17858&action=view)
Impact of X86_TUNE_FOUR_JUMP_LIMIT on SPEC CPU 2K
This is my old data of X86_TUNE_FOUR_JUMP_LIMIT on Penryn a
--- Comment #28 from vvv at ru dot ru 2009-05-13 19:18 ---
(In reply to comment #24)
> Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
> it will increase code size.
It's enough only one byte NOP per 16-byte chunk for padding. But, IMHO, four
branches in 16
--- Comment #27 from jakub at gcc dot gnu dot org 2009-05-13 19:08 ---
If inserting the padding isn't worth it for say core2, m_CORE2 could be dropped
from X86_TUNE_FOUR_JUMP_LIMIT, but certainly it would be interesting to see
SPEC numbers backing that up. Similarly for AMD CPUs, and if
--- Comment #26 from vvv at ru dot ru 2009-05-13 19:05 ---
(In reply to comment #23)
> Note that we need something that works for the generic model as well, which in
> this case looks like it is the same as for AMD models.
There is processor property TARGET_FOUR_JUMP_LIMIT, may be creat
--- Comment #25 from vvv at ru dot ru 2009-05-13 18:56 ---
(In reply to comment #22)
> CCing H.J for Intel optimization issues.
VVV> 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),
but
VVV> Intel limitation for 16-bytes chunk (memory range -
+10
--- Comment #24 from hjl dot tools at gmail dot com 2009-05-13 18:45
---
Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
it will increase code size.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #23 from rguenth at gcc dot gnu dot org 2009-05-13 18:34
---
Note that we need something that works for the generic model as well, which in
this case looks like it is the same as for AMD models.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #22 from ubizjak at gmail dot com 2009-05-13 18:22 ---
(In reply to comment #21)
> I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD
> OpteronTM
> processors, but it is nonoptimal for Intel processors. Because:
...
CCing H.J for Intel optimization issue
--- Comment #21 from vvv at ru dot ru 2009-05-13 17:13 ---
I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
processors, but it is nonoptimal for Intel processors. Because:
1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but
Intel li
--- Comment #20 from rguenth at gcc dot gnu dot org 2009-05-13 13:31
---
Instruction decoders generally operate on whole cache-lines, so 16-byte chunk
very very likely refers to a cache-line.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #19 from vvv at ru dot ru 2009-05-13 11:42 ---
(In reply to comment #18)
> No, .p2align is the right thing to do, given that GCC doesn't have 100%
> accurate information about instruction sizes (for e.g. inline asms it can't
> have, for
> stuff where branch shortening can dec
--- Comment #18 from jakub at gcc dot gnu dot org 2009-05-13 08:30 ---
No, .p2align is the right thing to do, given that GCC doesn't have 100%
accurate information about instruction sizes (for e.g. inline asms it can't
have, for
stuff where branch shortening can decrease the size doesn't
--- Comment #17 from vvv at ru dot ru 2009-05-12 16:40 ---
(In reply to comment #16)
> Created an attachment (id=17783)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view) [edit]
> gcc45-pr39942.patch
> Patch that attempts to take into account .p2align directives that
--- Comment #16 from jakub at gcc dot gnu dot org 2009-04-30 09:07 ---
Created an attachment (id=17783)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view)
gcc45-pr39942.patch
Patch that attempts to take into account .p2align directives that are emitted
for (some) COD
--- Comment #15 from vvv at ru dot ru 2009-04-29 19:16 ---
One more example 5-bytes nop between leaveq and retq.
# cat test.c
void wait_for_enter()
{
int u = getchar();
while (!u)
u = getchar()-13;
}
main()
{
wait_for_enter();
}
# gcc -o t.out test.c -O
--- Comment #14 from jakub at gcc dot gnu dot org 2009-04-29 10:12 ---
Also, couldn't we use the information computed by compute_alignments and
assume CODE_LABELs are aligned?
Probably would need to add label_to_max_skip (rtx) function to final.c,
so that not just label_to_alignment, but
--- Comment #13 from jakub at gcc dot gnu dot org 2009-04-29 09:32 ---
You are benchmarking something completely unrelated.
What really matters is how code that has 4 branches/calls in one 16-byte block
is able to predict all those branches. And Core2 similarly to various AMD CPUs
is no
--- Comment #12 from vvv at ru dot ru 2009-04-29 07:55 ---
(In reply to comment #9)
> So that explains it, Use -Os or attribute cold if you want NOPs to be gone.
But my measurements on Core 2 Duo P8600 show that
push %ebp
mov %esp,%ebp
leave
ret
_faster_ then
push %ebp
mov %esp,%eb
--- Comment #11 from vvv at ru dot ru 2009-04-29 07:46 ---
(In reply to comment #8)
> From config/i386/i386.c:
> /* AMD Athlon works faster
>when RET is not destination of conditional jump or directly preceded
>by other jump instruction. We avoid the penalty by inserting NOP jus
--- Comment #10 from ubizjak at gmail dot com 2009-04-28 21:53 ---
Actually, alignment is from ix86_avoid_jump_misspredicts, where:
/* Look for all minimal intervals of instructions containing 4 jumps.
The intervals are bounded by START and INSN. NBYTES is the total
size of
--- Comment #9 from pinskia at gcc dot gnu dot org 2009-04-28 21:52 ---
So that explains it, Use -Os or attribute cold if you want NOPs to be gone.
--
pinskia at gcc dot gnu dot org changed:
What|Removed |Added
-
--- Comment #8 from ubizjak at gmail dot com 2009-04-28 21:47 ---
>From config/i386/i386.c:
/* AMD Athlon works faster
when RET is not destination of conditional jump or directly preceded
by other jump instruction. We avoid the penalty by inserting NOP just
before the RET inst
--- Comment #6 from vvv at ru dot ru 2009-04-28 21:18 ---
Let's compile file test.c
//#file test.c
extern int F(int m);
void func(int x)
{
int u = F(x);
while (u)
u = F(u)*3+1;
}
# gcc -o t.out test.c -c -O2
# objdump -d t.out
t.out: file format e
--- Comment #7 from pinskia at gcc dot gnu dot org 2009-04-28 21:23 ---
4.1.2 produces:
.L4:
addq$8, %rsp
.p2align 4,,2
ret
While the trunk produces:
.L1:
addq$8, %rsp
.p2align 4,,2
.p2align 3
ret
--
http://gcc.gnu.org
--- Comment #5 from ubizjak at gmail dot com 2009-04-28 17:37 ---
Unfortunately, all code snippets and dumps are of no use. Please see
http://gcc.gnu.org/bugs.html for the reason why.
As an exercise, please compile *standalone* _preprocessed_ source you will
create with -S added to your
--- Comment #4 from vvv at ru dot ru 2009-04-28 17:15 ---
Created an attachment (id=1)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=1&action=view)
Simple example from Linux
See two functons:
static void pre_schedule_rt
static void switched_from_rt
--
http://gcc.gnu.o
--- Comment #3 from vvv at ru dot ru 2009-04-28 17:10 ---
Additional examples from Linux Kernel 2.6.29.1:
(Note: conditional statement at the end of all fuctions!)
=
linux/drivers/video/console/bitblit.c
void fbcon_set_bitops(struct fbcon_ops *ops)
{
ops->bmove
--- Comment #2 from vvv at ru dot ru 2009-04-28 17:04 ---
Created an attachment (id=17776)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17776&action=view)
Source file from Linx Kernel 2.6.29.1
See static void set_blitting_type
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3
--- Comment #1 from pinskia at gcc dot gnu dot org 2009-04-28 13:42 ---
Can you provide the preprocessed source which contains set_blitting_type?
--
pinskia at gcc dot gnu dot org changed:
What|Removed |Added
---
53 matches
Mail list logo