--
Summary: Nonoptimal save/restore registers
Product: gcc
Version: 4.5.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru
http
--- Comment #4 from vvv at ru dot ru 2009-05-25 19:54 ---
(In reply to comment #2)
This is very odd? What is the assembler doing that the compiler isn't?
There are exist some optimizations impossible without exact knowledge of
address and opcodes,
One example avoiding of branch
--- Comment #49 from vvv at ru dot ru 2009-05-20 21:38 ---
(In reply to comment #48)
How this patches work? Is it required some special options?
# /media/disk-1/B/bin/gcc --version
gcc (GCC) 4.5.0 20090520 (experimental)
# cat test.c
void f(int i)
{
if (i == 1) F(1
!
Product: gcc
Version: 4.4.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40171
--- Comment #30 from vvv at ru dot ru 2009-05-14 09:01 ---
Created an attachment (id=17863)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863action=view)
Testing tool.
Here is results of my testing.
Code:
align 128
test_cikl:
rept 14 ; 14 if SH=0, 15 if SH=1
--- Comment #34 from vvv at ru dot ru 2009-05-14 19:43 ---
(In reply to comment #32)
Please make sure that you only test nop paddings for branch insns,
not nop paddings for branch targets, which prefer 16byte alignment.
Additional tests (for Core2) results:
1. Execution time don't
--- Comment #19 from vvv at ru dot ru 2009-05-13 11:42 ---
(In reply to comment #18)
No, .p2align is the right thing to do, given that GCC doesn't have 100%
accurate information about instruction sizes (for e.g. inline asms it can't
have, for
stuff where branch shortening can
--- Comment #21 from vvv at ru dot ru 2009-05-13 17:13 ---
I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
processors, but it is nonoptimal for Intel processors. Because:
1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but
Intel
--- Comment #25 from vvv at ru dot ru 2009-05-13 18:56 ---
(In reply to comment #22)
CCing H.J for Intel optimization issues.
VVV 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),
but
VVV Intel limitation for 16-bytes chunk (memory range -
+10h
--- Comment #26 from vvv at ru dot ru 2009-05-13 19:05 ---
(In reply to comment #23)
Note that we need something that works for the generic model as well, which in
this case looks like it is the same as for AMD models.
There is processor property TARGET_FOUR_JUMP_LIMIT, may be create
--- Comment #28 from vvv at ru dot ru 2009-05-13 19:18 ---
(In reply to comment #24)
Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
it will increase code size.
It's enough only one byte NOP per 16-byte chunk for padding. But, IMHO, four
branches in 16
--- Comment #17 from vvv at ru dot ru 2009-05-12 16:40 ---
(In reply to comment #16)
Created an attachment (id=17783)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783action=view) [edit]
gcc45-pr39942.patch
Patch that attempts to take into account .p2align directives
ReportedBy: vvv at ru dot ru
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40093
--- Comment #1 from vvv at ru dot ru 2009-05-10 16:43 ---
Created an attachment (id=17847)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17847action=view)
Example direct/inverse calls
Simple example. RDTSC ticks for direct and inverse sequence of calls.
--
http://gcc.gnu.org
--- Comment #3 from vvv at ru dot ru 2009-05-10 18:08 ---
(In reply to comment #2)
This should have been done already with cgraph order.
Unfortunately, I can see inverse order only in separate source file. Inverse
but not optimized.
Example:
// file order1.c
#include stdio.h
main(int
--- Comment #5 from vvv at ru dot ru 2009-05-10 18:20 ---
(In reply to comment #4)
Well you need whole program to get the behavior which you want.
Yes. Of course, it's no problem for small single-programmer project, but it's
problem for big projects like Linux Kernel.
--
http
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40072
--- Comment #11 from vvv at ru dot ru 2009-04-29 07:46 ---
(In reply to comment #8)
From config/i386/i386.c:
/* AMD Athlon works faster
when RET is not destination of conditional jump or directly preceded
by other jump instruction. We avoid the penalty by inserting NOP just
--- Comment #12 from vvv at ru dot ru 2009-04-29 07:55 ---
(In reply to comment #9)
So that explains it, Use -Os or attribute cold if you want NOPs to be gone.
But my measurements on Core 2 Duo P8600 show that
push %ebp
mov %esp,%ebp
leave
ret
_faster_ then
push %ebp
mov %esp
--- Comment #15 from vvv at ru dot ru 2009-04-29 19:16 ---
One more example 5-bytes nop between leaveq and retq.
# cat test.c
void wait_for_enter()
{
int u = getchar();
while (!u)
u = getchar()-13;
}
main()
{
wait_for_enter();
}
# gcc -o t.out test.c
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
--- Comment #2 from vvv at ru dot ru 2009-04-28 17:04 ---
Created an attachment (id=17776)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17776action=view)
Source file from Linx Kernel 2.6.29.1
See static void set_blitting_type
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id
--- Comment #3 from vvv at ru dot ru 2009-04-28 17:10 ---
Additional examples from Linux Kernel 2.6.29.1:
(Note: conditional statement at the end of all fuctions!)
=
linux/drivers/video/console/bitblit.c
void fbcon_set_bitops(struct fbcon_ops *ops)
{
ops-bmove
--- Comment #4 from vvv at ru dot ru 2009-04-28 17:15 ---
Created an attachment (id=1)
-- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=1action=view)
Simple example from Linux
See two functons:
static void pre_schedule_rt
static void switched_from_rt
--
http
--- Comment #6 from vvv at ru dot ru 2009-04-28 21:18 ---
Let's compile file test.c
//#file test.c
extern int F(int m);
void func(int x)
{
int u = F(x);
while (u)
u = F(u)*3+1;
}
# gcc -o t.out test.c -c -O2
# objdump -d t.out
t.out: file format
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39549
: vvv at ru dot ru
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39520
27 matches
Mail list logo