[Bug target/36133] GCC creates suboptimal ASM : Code includes unneeded TST instructions

2008-11-10 Thread gunnar at greyhound-data dot com


--- Comment #8 from gunnar at greyhound-data dot com  2008-11-10 12:54 
---
(In reply to comment #7)
> (In reply to comment #4)
> > There are two causes where GCC generates unneeded TST instructions.
> > A) General arithmetic 
> >  lsr.l #1,D0 
> >  tst.l d0
> >  jbne ...
> >
> > This tst instruction is unneeded as the LSR is setting the flags correctly
> > already.
> 
> This is NOT correct. LSL does write to the condition codes, but not all of it.
> In particular, the bit involved in the not-equal test is not set.
> 
> This TST *is* required.

What you is say is not correct.

The bit involved in the not-eval test is the "Z-Bit"
Both the LSR and TST do set the Z-Bit, 100% equally.

The TST instruction in the example is 100% unneeded and NOT required.

Please check the official 68K documentation for the which flags the conditinal
branch instruction BCC tests (in this case BNE), and verify the behavior of TST
and LSR in regards of setting these bits.

Best Regards
Gunnar von Boehn


-- 

gunnar at greyhound-data dot com changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36133



[Bug middle-end/36770] PowerPC generated PTR code inefficiency

2008-07-10 Thread gunnar at greyhound-data dot com


--- Comment #2 from gunnar at greyhound-data dot com  2008-07-10 09:18 
---
(In reply to comment #1)
> forward-propagate is causing some of the issues as shown by:
> int *test2(int *a ){
>   a[1]=a[0];
>   a++;
>   return a;
> }

Your example creates the following ASM code:
test2:
mr 9,3
addi 3,3,4
lwz 0,0(9)
stw 0,4(9)
blr

Correct would be:
test2:
lwz 0,0(3)
stwu 0,4(3)
blr

Is you can see the created bad code is just the same.
This is independent of the register pinning.

Can I understand you comment a verification that the forward propagation is
broken in GCC/PPC?


Kind regards

Gunnar von Boehn


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36770



[Bug c/36772] New: GCC generates impossible BRANCH instruction

2008-07-09 Thread gunnar at greyhound-data dot com
Andreas Schwab and Gunther Nikl have pointed out that GCC
will incorrectly create "on purpose" impossible branch instructions.

Reason Summary:
- GCC is able to simplify certain compares.
- GCC seems to be unable to correctly rewrite the corresponding branch
instructions.
- GCC thereby generates branches that are by definition impossible to do.

Example:

  void foo (unsigned long j)
  {
unsigned int i;
for (i = 0; i < (j>>5); ++i)
  ;
  }


In this example the generated code does include a compare with a variable that
is ZERO.

GCC correctly knows that it can simplify a compare against ZERO.
But GCC fails to correctly adapt the condition codes.

Background:
A compare of two unsigned variables can set the CARRY flag.
A compare against ZERO can never set the CARRY flag.

While GCC recognizes that it can rewrite a CMP with 0 ,
GCC does not rewrite the branch that checks for the Carry.

GCC hereby creates branches that include conditions that are known to be
impossible at compile time.


> The following link should be the thread about this issue 
> http://gcc.gnu.org/ml/gcc/2003-10/msg01236.html


The problem that we describe here leads to the following:
GCC creates branches that include impossible conditions.
GCC does not remove the impossible conditions.
GCC wants to rewrite the CMP with 0 with a more efficient code.
But the 68K backend need to forbid this.
Normally it would be possible to leave the CMP 0,variable simply away.
And all branches that test for a condition that could actually be set by a cmp
with 0 would be possible to evaluate.
But as GCC writes branches which include known impossible conditions leaving
the cmp away is not possible.
The CMP with 0 is explicitly needed to ensure that all flags are cleared to
ensure that the branches to known impossible conditions are not taken.


It would be great if you could fix the reason for this instead curing the
sympton.
As this would allow the backend to really remove the unneeded CMP instructions
thereby generating smaller and faster code.

Many thanks in advance

Gunnar von Boehn


-- 
   Summary: GCC generates impossible BRANCH instruction
   Product: gcc
   Version: 4.4.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: gunnar at greyhound-data dot com
 GCC build triplet: m68k-linux-gnu
  GCC host triplet: m68k-linux-gnu
GCC target triplet: m68k-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36772



[Bug c/36770] New: PowerPC generated PTR code ineffiency

2008-07-09 Thread gunnar at greyhound-data dot com
GCC fails to generate efficient code for basic pointer operations.

Please have a look at this example:
***
test.c:
register int * src asm("r15");

int test( ){
  src[1]=src[0];
  src++;
}

main(){
}

***

compile the above with gcc -S -O3 test.c

shows us the following ASM output:

test:
mr 9,15
addi 15,15,4
lwz 0,0(9)
stw 0,4(9)
blr

compile with gcc -S -Os test.c
Gives this output
test:
mr 9,15
addi 15,15,4
lwz 0,0(9)
stw 0,4(9)
blr


As you can see both -O3 and -Os produce the same output.
The generated output is far from optimal.

GCC generates for the simple pointer operation this code:
mr 9,15
addi 15,15,4
lwz 0,0(9)
stw 0,4(9)

But GCC should rather generate this:
lwz 0,0(15)
stwu 0,4(15)


Two of the four instructions are unneeded.
We've here code with literally thousands of unneeded instructions generated
like this.


I very much hope that this information is helpful to you and that you can fix
this.

Many thanks in advance

Gunnar von Boehn


-- 
   Summary: PowerPC generated PTR code ineffiency
   Product: gcc
   Version: 4.3.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
    ReportedBy: gunnar at greyhound-data dot com
 GCC build triplet: powerpc64-unknown-linux-gnu
  GCC host triplet: powerpc64-unknown-linux-gnu
GCC target triplet: powerpc64-unknown-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36770



[Bug target/36135] GCC creates suboptimal ASM : suboptimal Adressing-Modes used

2008-06-13 Thread gunnar at greyhound-data dot com


--- Comment #6 from gunnar at greyhound-data dot com  2008-06-13 13:34 
---
(In reply to comment #4)
> This comes down to IV-OPTs not understanding {post,pre}_{dec,inc} at all. 
> There is another bug about this somewhere I think for arm.  PowerPC has the
> same issue too ...
> 

Hi Andrew,

I want to make clear that the 68K backend used to be able to do this
optimization in the GCC 2.9 times. Later with 3.4 or 4.x this optmization did
not work anymore and the code became worth.
Does this make sense in your opinion?


Cheers


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36135



[Bug target/36135] GCC creates suboptimal ASM : suboptimal Adressing-Modes used

2008-06-13 Thread gunnar at greyhound-data dot com


--- Comment #5 from gunnar at greyhound-data dot com  2008-06-13 09:31 
---
(In reply to comment #4)
> This comes down to IV-OPTs not understanding {post,pre}_{dec,inc} at all. 
> There is another bug about this somewhere I think for arm.  PowerPC has the
> same issue too ...
> 

If this effects so many platforms this sounds like an important issue to me.
Maybe someone should increase the priority and severity of the issue in this
case?

Andrew, do you plan to fix this issue?

Cheers
Gunnar


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36135



[Bug target/36135] GCC creates suboptimal ASM : suboptimal Adressing-Modes used

2008-06-12 Thread gunnar at greyhound-data dot com


--- Comment #3 from gunnar at greyhound-data dot com  2008-06-12 14:34 
---
Andreas,

What is your opinion to this?

GCC 2.9 used to combine the move with increment in the combine step to
something like this:
***
(insn 32 30 33 (set (reg/v:SI 32)
(mem:SI (post_inc:SI (reg/v:SI 34)) 0)) 42 {movsi+1} (nil)
(expr_list:REG_INC (reg/v:SI 34)
(nil)))
***


So problem is that now GCC seems not to be able to do this anymore by itself
With GCC 4.4 the output is:
**
(insn 34 33 35 4 example2.c:11 (set (reg/v:SI 54 [ value ])
(mem:SI (reg/v/f:SI 52 [ src ]) [2 S4 A16])) 37 {*movsi_cf} (nil))

(insn 35 34 36 4 example2.c:12 (set (reg/v:SI 53 [ value2 ])
(mem:SI (plus:SI (reg/v/f:SI 52 [ src ])
(const_int 4 [0x4])) [2 S4 A16])) 37 {*movsi_cf} (nil))

(insn 36 35 38 4 example2.c:5 (set (reg/v/f:SI 52 [ src ])
(plus:SI (reg/v/f:SI 52 [ src ])
(const_int 8 [0x8]))) 133 {*addsi3_5200} (nil))

(insn 38 36 40 4 example2.c:10 (set (reg/v:SI 50 [ size.21 ])
(plus:SI (reg/v:SI 50 [ size.21 ])
(const_int -1 [0x]))) 133 {*addsi3_5200} (nil))
***

Any ideas about this?


Kind regards

Gunnar von Boehn


-- 

gunnar at greyhound-data dot com changed:

   What|Removed |Added

 CC||schwab at suse dot de


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36135



[Bug target/36134] GCC creates suboptimal ASM : usage of ADDA.L where LEA could be used

2008-06-12 Thread gunnar at greyhound-data dot com


--- Comment #6 from gunnar at greyhound-data dot com  2008-06-12 14:27 
---
Andreas,

could you please have a look at this?

Cheers
Gunnar


-- 

gunnar at greyhound-data dot com changed:

   What|Removed |Added

 CC||schwab at suse dot de


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36134



[Bug target/36133] GCC creates suboptimal ASM : Code includes unneeded TST instructions

2008-06-12 Thread gunnar at greyhound-data dot com


--- Comment #6 from gunnar at greyhound-data dot com  2008-06-12 14:26 
---
Andreas,

Could you have a look at this?

Cheers
Gunnar


-- 

gunnar at greyhound-data dot com changed:

   What|Removed |Added

 CC||schwab at suse dot de


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36133



[Bug target/25128] [m68k] Suboptimal comparisons against 65536

2008-06-10 Thread gunnar at greyhound-data dot com


--- Comment #2 from gunnar at greyhound-data dot com  2008-06-10 16:02 
---

> Note that
> 
> cmp.l #65535,%d0
> jbhi .L10
> 
> can be replaced with
> 
> swap %d0
> tst.w %d0
> jbne .L10
> 
> A similar trick can be applied to signed comparisons as well.

But this "trick" will run slower on the higher 68k CPUs.
On 68040 or 68060 or SuperScalar Coldfire its better to generate less
instructions that do not have dependancies.

I think "cmp.l #65535,%d0" is the code that should be generated by "O2" as its
faster on many 68K models.
The shorter two instruction trick might be an option for compile optiont "Os"


Kind regards

Gunnar von Boehn


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25128



[Bug c/36488] New: Generated 68K code bad for pipelining (case swap)

2008-06-10 Thread gunnar at greyhound-data dot com
+++ This bug was initially created as a clone of Bug #36487 +++

The code generation by GCC 4.4 (trunk) for 68K/Coldfire will run slow on the
SuperScalar pipelines of the 68060 and Coldfire V4/V5 cores.


if you compilining this example:

uint32_t fletcher( uint16_t *data, size_t len )
{
uint32_t sum1 = 0x, sum2 = 0x;

while (len) {
unsigned tlen = len > 360 ? 360 : len;
len -= tlen;
do {
sum1 += *data++;
sum2 += sum1;
} while (--tlen);
sum1 = (sum1 & 0x) + (sum1 >> 16);
sum2 = (sum2 & 0x) + (sum2 >> 16);
}
/* Second reduction step to reduce sums to 16 bits */
sum1 = (sum1 & 0x) + (sum1 >> 16);
sum2 = (sum2 & 0x) + (sum2 >> 16);
return sum2 << 16 | sum1;
}




with 
"m68k-linux-gnu-gcc -mcpu=54470 -fomit-frame-pointer -O3 -S -o example.s
example.c"

Then you will see that this code is created:

1   clr.w %d3
2   swap %d3
3   clr.w %d4
4   swap %d4


Instruction 2 depends on instruction 1 
Instruction 4 depends on instruction 3

A simple reorder of the code to have the instruction in that order
would double the performance as now Superscaler design as 68060 or V5 Coldfire
can execute more instruction in parrallel

1   clr.w %d3
2   clr.w %d4
3   swap %d3
4   swap %d4


GCC does not try to reduce the instruction dependencies.
The Code that GCC generates does not follow the scheduling recommendation for
68040/68060 and above multiscalar CPUs.

Can you please be so kind and correct this?


-- 
   Summary: Generated 68K code bad for pipelining (case swap)
   Product: gcc
   Version: 4.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: gunnar at greyhound-data dot com
  GCC host triplet: m68k-linux-gnu
GCC target triplet: m68k-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36488



[Bug c/36487] New: Generated 68K code bad for pipelining

2008-06-10 Thread gunnar at greyhound-data dot com
The code generation by GCC 4.4 (trunk) for 68K/Coldfire will run slow on the
SuperScalar pipelines of the 68060 and Coldfire V4/V5 cores.


if you compilining this example:

uint32_t fletcher( uint16_t *data, size_t len )
{
uint32_t sum1 = 0x, sum2 = 0x;

while (len) {
unsigned tlen = len > 360 ? 360 : len;
len -= tlen;
do {
sum1 += *data++;
sum2 += sum1;
} while (--tlen);
sum1 = (sum1 & 0x) + (sum1 >> 16);
sum2 = (sum2 & 0x) + (sum2 >> 16);
}
/* Second reduction step to reduce sums to 16 bits */
sum1 = (sum1 & 0x) + (sum1 >> 16);
sum2 = (sum2 & 0x) + (sum2 >> 16);
return sum2 << 16 | sum1;
}




with 
"m68k-linux-gnu-gcc -mcpu=68060 -fomit-frame-pointer -O3 -S -o example.s
example.c"

Then you will see that this defination will generate the below code:
{
 uint32_t sum1 = 0x, sum2 = 0x;
}
moveq #0,%d2
not.w %d2
move.l %d2,%d3

That are THREE depending instructions in a row.
Even with result forwarding these THREE instruction will need 3 clocks to
execute. Instead writing the above in three lines the compiler could have
generated two lines like this:
 move.l #0x,%d2
 move.l #0x,%d3

Or the compiler could have put other independing instructions between those.

GCC does not try to reduce the instruction dependencies.
The Code that GCC generates does not follow the scheduling recommendation for
68040/68060 and above multiscalar CPUs.

Please be so kind and correct this.


-- 
   Summary: Generated 68K code bad for pipelining
   Product: gcc
   Version: 4.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: gunnar at greyhound-data dot com
  GCC host triplet: m68k-linux-gnu
GCC target triplet: m68k-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36487



[Bug target/36134] GCC creates suboptimal ASM : usage of ADDA.L where LEA could be used

2008-06-10 Thread gunnar at greyhound-data dot com


--- Comment #5 from gunnar at greyhound-data dot com  2008-06-10 15:24 
---
(In reply to comment #4)
> Could you please submit your patch to [EMAIL PROTECTED], including a
> ChangeLog entry and stating how you tested it.
> 

As requested I did send the email last week.
Do you need anything else from me to work on this?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36134



[Bug target/36133] GCC creates suboptimal ASM : Code includes unneeded TST instructions

2008-06-05 Thread gunnar at greyhound-data dot com


--- Comment #5 from gunnar at greyhound-data dot com  2008-06-05 12:07 
---
Please find below a proposed patch.
The patch will making GCC aware that shift does set the CC already
and the TST is not needed in this case.
The same example could be used to used to make GCC aware of the CC set by other
instructions.

Index: gcc/config/m68k/m68k.md

===

*** gcc/config/m68k/m68k.md.orig2008-05-30 10:00:55.0 +0200

--- gcc/config/m68k/m68k.md 2008-06-04 17:01:11.0 +0200

***

*** 5198,5203 

--- 5198,5215 

[(set_attr "type" "shift")

 (set_attr "opy" "2")])



+ (define_insn "*lshrsi3_cc"

+   [(set (cc0)

+   (lshiftrt:SI (match_operand:SI 1 "register_operand" "0")

+(match_operand:SI 2 "general_operand" "dI")))

+(set (match_operand:SI 0 "register_operand" "=d")

+   (lshiftrt:SI (match_dup 1)

+(match_dup 2)))]

+   ""

+   "lsr%.l %2,%0"

+   [(set_attr "type" "shift")

+(set_attr "opy" "2")])

+

  (define_insn "lshrhi3"

[(set (match_operand:HI 0 "register_operand" "=d")

(lshiftrt:HI (match_operand:HI 1 "register_operand" "0")


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36133



[Bug c/36433] mregeparm not supported on 68k / Coldfire

2008-06-04 Thread gunnar at greyhound-data dot com


--- Comment #1 from gunnar at greyhound-data dot com  2008-06-04 09:54 
---
The parameter -mregparm is not supported on M68k / Coldfire.

As it its known from the X86 platform compiling with mregparm does improve the
size and performance of the generated code. On X86 an overall improvement of
5%-7% is generally stated. This parameter is unfortunately not supported for
the M68k and Coldfire platform. This is a serious drawback especiall as on 68k
there are operating systems which have parameter passing in registers as their
default behavior. (i.e AmigaOS)

Please be so kind and add the regparm feature to the 68k Coldfire.
It will certainly improve generated code a lot.

Many thanks in advance

Gunnar von Boehn


-- 

gunnar at greyhound-data dot com changed:

   What|Removed |Added

Summary|mregeparm   |mregeparm not supported on
   ||68k / Coldfire


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36433



[Bug c/36433] New: mregeparm

2008-06-04 Thread gunnar at greyhound-data dot com



-- 
   Summary: mregeparm
   Product: gcc
   Version: 4.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: gunnar at greyhound-data dot com
 GCC build triplet: m68k-linux-gnu
  GCC host triplet: m68k-linux-gnu
GCC target triplet: m68k-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36433



[Bug target/36133] GCC creates suboptimal ASM : Code includes unneeded TST instructions

2008-06-04 Thread gunnar at greyhound-data dot com


--- Comment #4 from gunnar at greyhound-data dot com  2008-06-04 09:29 
---
I want to add that this wrong behavior is partly related to the compile option
"-Os".

There are two causes where GCC generates unneeded TST instructions.
A) General arithmetic 
 lsr.l #1,D0 
 tst.l d0
 jbne ...

This tst instruction is unneeded as the LSR is setting the flags correctly
already.

B) subq.l #1,D1
  tst.l d1
  jbne ...

This unneeded TST is related to the compile option used.
If you compile the source with "-O2" then the second unneeded TST instructions
are not included in the source.

It seems to me that a general important optimizations step - which used to be
in  "Os" in GCC 2.9 was removed from "Os" causing GCC to generate worse code
now.

Can you please be so kind and correct this?

I believe that this issue is quite serious for the performance of the generated
code.

1st The unneeded TST instructions are increasing code size, which is important
in embedded environments.

2nd
There are case were the instruction which really did set the condition codes
correctly in the first place is far enough away from the conditional branch and
no CC trashing instruction in between them - so that the instruction fetcher
can 100% correctly predict the branch and fold it away completely. The unneeded
TST instruction makes branch folding impossible and requires the CPU to guess
the branch instead. This will cause a serious performance impact in case of
mispredicting the branch.
It should be clear that the unneeded TST instruction doas not only bloat the
code but the above mentioned conditions can serious degrade the performance as
well, depending on your used CPU of course.

In the light of this, wouldn't it might sense to increase the Severity of this
issue?


Regards
Gunnar


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36133



[Bug target/36134] GCC creates suboptimal ASM : usage of ADDA.L where LEA could be used

2008-05-29 Thread gunnar at greyhound-data dot com


--- Comment #3 from gunnar at greyhound-data dot com  2008-05-29 12:50 
---
Created an attachment (id=15699)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15699&action=view)
Prefer 4 byte long LEA over 6 byte long ADD.L

Please include the attached patch for GCC.

The added patch has changed the case statement to prefer the 4 byte long lea
over the 6 byte long add.l for immediate sub/add instructions to address
registers with an immediate operant size of 16bit max. 

LEA is optimized for pipelining (with destination forwarding) and is shorter
than ADD.L


Regards
Gunnar von Boehn


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36134



[Bug target/36136] GCC creates suboptimal ASM : constant work registers are set up inside work loops and not outside of the loop

2008-05-28 Thread gunnar at greyhound-data dot com


--- Comment #2 from gunnar at greyhound-data dot com  2008-05-28 16:28 
---
(In reply to comment #1)
> It would have been nice to check at least gcc 4.3 (or better current trunk).
> 

I have verified this with the most current GCC source trunk.
GCC 4.4 code snapshot 2008-05-23

The problem is still persistant.
GCC sets up his work registers inside the work loop.



write_32x4:
link.w %fp,#0
move.l 16(%fp),%d0
move.l 8(%fp),%a0
lsr.l #4,%d0
jra .L50
.L51:
moveq #1,%d1
move.l %d1,(%a0)
move.l %d1,4(%a0)
move.l %d1,8(%a0)
move.l %d1,12(%a0)
lea (16,%a0),%a0
subq.l #1,%d0
.L50:
tst.l %d0
jne .L51
unlk %fp
rts


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36136



[Bug target/36135] GCC creates suboptimal ASM : suboptimal Adressing-Modes used

2008-05-28 Thread gunnar at greyhound-data dot com


--- Comment #2 from gunnar at greyhound-data dot com  2008-05-28 16:23 
---
(In reply to comment #1)
> It would have been nice to check at least gcc 4.3 (or better current trunk).
> 

I have verified this for you with the most current GCC source.
Verified with gcc version 4.4.0 20080523 (experimental) (GCC) 

The problem that GCC uses bad addressing modes is still persistent.

Code generated by GCC 4.4 
copy_32x4:
link.w %fp,#-12
movem.l #3076,(%sp)
move.l 16(%fp),%d2
lsr.l #4,%d2
move.l 8(%fp),%a3
move.l 12(%fp),%a2
jra .L6
.L7:
move.l (%a2),%a1
subq.l #1,%d2
move.l 4(%a2),%d0
move.l 8(%a2),%d1
move.l 12(%a2),%a0
add.l #16,%a2
move.l %a1,(%a3)
move.l %d0,4(%a3)
move.l %d1,8(%a3)
move.l %a0,12(%a3)
add.l #16,%a3
.L6:
tst.l %d2
jne .L7
movem.l (%sp),#3076
unlk %fp
rts


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36135



[Bug target/36134] GCC creates suboptimal ASM : usage of ADDA.L where LEA could be used

2008-05-28 Thread gunnar at greyhound-data dot com


--- Comment #2 from gunnar at greyhound-data dot com  2008-05-28 16:18 
---
(In reply to comment #1)
> It would have been nice to check at least gcc 4.3 (or better current trunk).
> 

I've verified with latest source gcc source "version 4.4.0 20080523
(experimental) (GCC)" 

The most current GCC source still has the problem
that ADD.L instructions are used for incrementing pointers instead using
shorter LEA instruction.


Code generated by GCC 4.4 for the testcase.

copy_32x4:
link.w %fp,#-12
movem.l #3076,(%sp)
move.l 16(%fp),%d2
lsr.l #4,%d2
move.l 8(%fp),%a3
move.l 12(%fp),%a2
jra .L6
.L7:
move.l (%a2),%a1
subq.l #1,%d2
move.l 4(%a2),%d0
move.l 8(%a2),%d1
move.l 12(%a2),%a0
add.l #16,%a2
move.l %a1,(%a3)
move.l %d0,4(%a3)
move.l %d1,8(%a3)
move.l %a0,12(%a3)
add.l #16,%a3
.L6:
tst.l %d2
jne .L7
movem.l (%sp),#3076
unlk %fp
rts



Regards
Gunnar


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36134



[Bug target/36133] GCC creates suboptimal ASM : Code includes unneeded TST instructions

2008-05-28 Thread gunnar at greyhound-data dot com


--- Comment #3 from gunnar at greyhound-data dot com  2008-05-28 16:14 
---
(In reply to comment #1)
> It would have been nice to check at least gcc 4.3 (or better current trunk).
> 

I've verified with latest source gcc source "version 4.4.0 20080523
(experimental) (GCC)" 
The problem that GCC used totally unneeded TST instructions is still in the
current source.

Regards
Gunnar


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36133



[Bug c/36136] New: GCC creates suboptimal ASM : constant work registers are set up inside work loops and not outside of the loop

2008-05-05 Thread gunnar at greyhound-data dot com
+++ This bug was initially created as a clone of Bug #36133 +++

Hello,

The ASM code created by GCC 4.2.1 for 68k/Coldfire platform is not optimal.
Comparing ASM output created by GCC 2.9 with GCC 4.2,
the generated code got partially much worse with GCC 4.


One problem that was visible a lot was that GCC set ups constant work registers
inside of working loops and not outside of them.
At address (1c): the instruction moveq #1,%d1 to set up the work register is
inside the working loop and will be unneeded executed with very iteration.

Second problem:
At address (16) the instruction movel #1,%a0@ uses the literal value #1 and not
the work register that has the same value. The literal move.l #1 has a length
of 6 bytes while using the work register would have 2 bytes only.


Example: C-source
Code:
void * write_32x4(void *destparam, const void *srcparam, size_t size)
{
int  value=1;
int *dst = destparam;
size = size / 16;
for (; size; size--) {
 *dst++=value;
 *dst++=value;
 *dst++=value;
 *dst++=value;
}
} 
Compile option: m68k-linux-gnu-gcc -mcpu=54455 -msoft-float -o example -Os
-fomit-frame-pointer example.c

Code generated by GCC 4.2:
:
0a:   202f 000c   movel %sp@(12),%d0
0e:   206f 0004   moveal %sp@(4),%a0
12:   e888lsrl #4,%d0
14:   601cbras 32
16:   20bc  0001  movel #1,%a0@
1c:   7201moveq #1,%d1
1e:   2141 0004   movel %d1,%a0@(4)
22:   2141 0008   movel %d1,%a0@(8)
26:   2141 000c   movel %d1,%a0@(12)
2a:   d1fc  0010  addal #16,%a0
30:   5380subql #1,%d0
32:   4a80tstl %d0
34:   66e0bnes 16
36:   4e75rts 
Generated code length = 46 Byte
Length of Workloop: 9 instructions, 32 byte 


For comparison here is code that you would expect:
0a:   202f 000c   movel %sp@(12),%d0
0e:   206f 0004   moveal %sp@(4),%a0
12:   7201moveq #1,%d1
14:   e888lsrl #4,%d0
16:   601cbeqs 24
18:   21c0movel %d1,[EMAIL PROTECTED]
1a:   21c0movel %d1,[EMAIL PROTECTED]
1c:   21c0movel %d1,[EMAIL PROTECTED]
1e:   21c0movel %d1,[EMAIL PROTECTED]
20:   5380subql #1,%d0
22:   66e0bnes 18
24:   4e75rts 
Expected code length = 28 Byte
Length of Workloop: 6 instructions, 12 byte 


Compiler used:
m68k-linux-gnu-gcc -v
Using built-in specs.
Target: m68k-linux-gnu
Configured with: /scratch/shinwell/cf-fall-linux-lite/src/gcc-4.2/configure
--build=i686-pc-linux-gnu --host=i686-pc-linux-gnu --target=m68k-linux-gnu
--enable-threads --disable-libmudflap --disable-libssp --disable-libgomp
--disable-libstdcxx-pch --with-arch=cf --with-gnu-as --with-gnu-ld
--enable-languages=c,c++ --enable-shared --enable-symvers=gnu
--enable-__cxa_atexit --with-pkgversion=Sourcery G++ Lite 4.2-47
--with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls
--prefix=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux
--with-sysroot=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux/m68k-linux-gnu/libc
--with-build-sysroot=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/libc
--enable-poison-system-directories
--with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin
--with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin
Thread model: posix
gcc version 4.2.1 (Sourcery G++ Lite 4.2-47)


I hope that this report help you to improve the quality of GCC.

Kind regards

Gunnar von Boehn
--
P.S. I put the noticed issues in individual tickets for easier tracking. I hope
that this is helpful to you.


-- 
   Summary: GCC creates suboptimal ASM : constant work registers are
set up inside work loops and not outside of the loop
   Product: gcc
   Version: 4.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: gunnar at greyhound-data dot com
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: m68k-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36136



[Bug c/36135] New: GCC creates suboptimal ASM : suboptimal Adressing-Modes used

2008-05-05 Thread gunnar at greyhound-data dot com
+++ This bug was initially created as a clone of Bug #36133 +++

Hello,

The ASM code created by GCC 4.2.1 for 68k/Coldfire platform is not optimal.
Comparing ASM output created by GCC 2.9 with GCC 4.2,
the generated code got partially much worse with GCC 4.


One problem that was visible a lot was that GCC uses suboptimal addressing
modes.

Please see the below example for details.
In line 14 to line 2E this code was created:
14:   2290movel %a0@,%a1@
16:   2368 0004 0004  movel %a0@(4),%a1@(4)
1c:   2368 0008 0008  movel %a0@(8),%a1@(8)
22:   2368 000c 000c  movel %a0@(12),%a1@(12)
28:   d3fc  0010  addal #16,%a1
2e:   d1fc  0010  addal #16,%a0

Much shorter and more efficient would have been this:
14:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
16:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
18:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
1a:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]


Example: C-source
Code:
void * copy_32x4a(void *destparam, const void *srcparam, size_t size)
{
int *dest = destparam;
const int *src = srcparam;
int size32;
size32 = size / 16;
for (; size32; size32--) {
*dest++ = *src++;
*dest++ = *src++;
*dest++ = *src++;
*dest++ = *src++;
}
}

Compile option: m68k-linux-gnu-gcc -mcpu=54455 -msoft-float -o example -Os
-fomit-frame-pointer example.c

Code generated by GCC 4.2:
04:   202f 000c   movel %sp@(12),%d0
08:   226f 0004   moveal %sp@(4),%a1
0c:   206f 0008   moveal %sp@(8),%a0
10:   e888lsrl #4,%d0
12:   6022bras 36
14:   2290movel %a0@,%a1@
16:   2368 0004 0004  movel %a0@(4),%a1@(4)
1c:   2368 0008 0008  movel %a0@(8),%a1@(8)
22:   2368 000c 000c  movel %a0@(12),%a1@(12)
28:   d3fc  0010  addal #16,%a1
2e:   d1fc  0010  addal #16,%a0
34:   5380subql #1,%d0
36:   4a80tstl %d0
38:   66dabnes 14
3a:   4e75rts

For comparison here is code that you would expect:
04:   202f 000c   movel %sp@(12),%d0
08:   226f 0004   moveal %sp@(4),%a1
0c:   206f 0008   moveal %sp@(8),%a0
10:   e888lsrl #4,%d0
12:   6022beq 20
14:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
16:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
18:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
1a:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
1c:   5380subql #1,%d0
1e:   66dabnes 14
20:   4e75rts 

Compiler used:
m68k-linux-gnu-gcc -v
Using built-in specs.
Target: m68k-linux-gnu
Configured with: /scratch/shinwell/cf-fall-linux-lite/src/gcc-4.2/configure
--build=i686-pc-linux-gnu --host=i686-pc-linux-gnu --target=m68k-linux-gnu
--enable-threads --disable-libmudflap --disable-libssp --disable-libgomp
--disable-libstdcxx-pch --with-arch=cf --with-gnu-as --with-gnu-ld
--enable-languages=c,c++ --enable-shared --enable-symvers=gnu
--enable-__cxa_atexit --with-pkgversion=Sourcery G++ Lite 4.2-47
--with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls
--prefix=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux
--with-sysroot=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux/m68k-linux-gnu/libc
--with-build-sysroot=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/libc
--enable-poison-system-directories
--with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin
--with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin
Thread model: posix
gcc version 4.2.1 (Sourcery G++ Lite 4.2-47)


I hope that this report help you to improve the quality of GCC.

Kind regards

Gunnar von Boehn
--
P.S. I put the noticed issues in individual tickets for easier tracking. I hope
that this is helpful to you.


-- 
   Summary: GCC creates suboptimal ASM : suboptimal Adressing-Modes
used
   Product: gcc
   Version: 4.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: gunnar at greyhound-data dot com
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: m68k-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36135



[Bug c/36134] New: GCC creates suboptimal ASM : usage of ADDA.L where LEA could be used

2008-05-05 Thread gunnar at greyhound-data dot com
+++ This bug was initially created as a clone of Bug #36133 +++

Hello,

The ASM code created by GCC 4.2.1 for 68k/Coldfire platform is not optimal.
Comparing ASM output created by GCC 2.9 with GCC 4.2,
the generated code got partially much worse with GCC 4.


One problem that was visible a lot was that GCC used ADDA.L instead using the
shorter LEA instruction.

Please see the below example for details.
In line 28 and 2E you can see that two times the ADDA.L instructions was used,
where instead the shorter LEA instruction could have been used.



Example: C-source
Code:
void * copy_32x4a(void *destparam, const void *srcparam, size_t size)
{
int *dest = destparam;
const int *src = srcparam;
int size32;
size32 = size / 16;
for (; size32; size32--) {
*dest++ = *src++;
*dest++ = *src++;
*dest++ = *src++;
*dest++ = *src++;
}
}

Compile option: m68k-linux-gnu-gcc -mcpu=54455 -msoft-float -o example -Os
-fomit-frame-pointer example.c

Code generated by GCC 4.2:
04:   202f 000c   movel %sp@(12),%d0
08:   226f 0004   moveal %sp@(4),%a1
0c:   206f 0008   moveal %sp@(8),%a0
10:   e888lsrl #4,%d0
12:   6022bras 36
14:   2290movel %a0@,%a1@
16:   2368 0004 0004  movel %a0@(4),%a1@(4)
1c:   2368 0008 0008  movel %a0@(8),%a1@(8)
22:   2368 000c 000c  movel %a0@(12),%a1@(12)
28:   d3fc  0010  addal #16,%a1
2e:   d1fc  0010  addal #16,%a0
34:   5380subql #1,%d0
36:   4a80tstl %d0
38:   66dabnes 14
3a:   4e75rts

For comparison here is code that you would expect:
04:   202f 000c   movel %sp@(12),%d0
08:   226f 0004   moveal %sp@(4),%a1
0c:   206f 0008   moveal %sp@(8),%a0
10:   e888lsrl #4,%d0
12:   6022beq 20
14:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
16:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
18:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
1a:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
1c:   5380subql #1,%d0
1e:   66dabnes 14
20:   4e75rts 

Compiler used:
m68k-linux-gnu-gcc -v
Using built-in specs.
Target: m68k-linux-gnu
Configured with: /scratch/shinwell/cf-fall-linux-lite/src/gcc-4.2/configure
--build=i686-pc-linux-gnu --host=i686-pc-linux-gnu --target=m68k-linux-gnu
--enable-threads --disable-libmudflap --disable-libssp --disable-libgomp
--disable-libstdcxx-pch --with-arch=cf --with-gnu-as --with-gnu-ld
--enable-languages=c,c++ --enable-shared --enable-symvers=gnu
--enable-__cxa_atexit --with-pkgversion=Sourcery G++ Lite 4.2-47
--with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls
--prefix=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux
--with-sysroot=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux/m68k-linux-gnu/libc
--with-build-sysroot=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/libc
--enable-poison-system-directories
--with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin
--with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin
Thread model: posix
gcc version 4.2.1 (Sourcery G++ Lite 4.2-47)


I hope that this report help you to improve the quality of GCC.

Kind regards

Gunnar von Boehn
--
P.S. I put the noticed issues in indivitual tivkets for easier tracking. I hope
that this is helpfull.


-- 
   Summary: GCC creates suboptimal ASM : usage of ADDA.L where LEA
could be used
   Product: gcc
   Version: 4.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: gunnar at greyhound-data dot com
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: m68k-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36134



[Bug c/36133] New: GCC creates suboptimal ASM : Code includes unneeded TST instructions

2008-05-05 Thread gunnar at greyhound-data dot com
Hello,

The ASM code created by GCC 4.2.1 for 68k/Coldfire platform is not optimal.
Comparing ASM output created by GCC 2.9 with GCC 4.2,
the generated code got partially much worse with GCC 4.


One problem that was visible in very many places was
that GCC created unnecessary TST instructions.

Please see the below example for details.
The TST.L instruction at address 36 is in the loop and unneeded.
The lsrl at address (10) and the subql #1,%d0 at address (34) do both set the
condition codes already, there is no need for using an extra TST instruction at
all.



Example: C-source
Code:
void * copy_32x4a(void *destparam, const void *srcparam, size_t size)
{
int *dest = destparam;
const int *src = srcparam;
int size32;
size32 = size / 16;
for (; size32; size32--) {
*dest++ = *src++;
*dest++ = *src++;
*dest++ = *src++;
*dest++ = *src++;
}
}

Compile option: m68k-linux-gnu-gcc -mcpu=54455 -msoft-float -o example -Os
-fomit-frame-pointer example.c

Code generated by GCC 4.2:
04:   202f 000c   movel %sp@(12),%d0
08:   226f 0004   moveal %sp@(4),%a1
0c:   206f 0008   moveal %sp@(8),%a0
10:   e888lsrl #4,%d0
12:   6022bras 36
14:   2290movel %a0@,%a1@
16:   2368 0004 0004  movel %a0@(4),%a1@(4)
1c:   2368 0008 0008  movel %a0@(8),%a1@(8)
22:   2368 000c 000c  movel %a0@(12),%a1@(12)
28:   d3fc  0010  addal #16,%a1
2e:   d1fc  0010  addal #16,%a0
34:   5380subql #1,%d0
36:   4a80tstl %d0
38:   66dabnes 14
3a:   4e75rts

For comparison here is code that you would expect:
04:   202f 000c   movel %sp@(12),%d0
08:   226f 0004   moveal %sp@(4),%a1
0c:   206f 0008   moveal %sp@(8),%a0
10:   e888lsrl #4,%d0
12:   6022beq 20
14:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
16:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
18:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
1a:   20d9movel [EMAIL PROTECTED],[EMAIL PROTECTED]
1c:   5380subql #1,%d0
1e:   66dabnes 14
20:   4e75rts 

Compiler used:
m68k-linux-gnu-gcc -v
Using built-in specs.
Target: m68k-linux-gnu
Configured with: /scratch/shinwell/cf-fall-linux-lite/src/gcc-4.2/configure
--build=i686-pc-linux-gnu --host=i686-pc-linux-gnu --target=m68k-linux-gnu
--enable-threads --disable-libmudflap --disable-libssp --disable-libgomp
--disable-libstdcxx-pch --with-arch=cf --with-gnu-as --with-gnu-ld
--enable-languages=c,c++ --enable-shared --enable-symvers=gnu
--enable-__cxa_atexit --with-pkgversion=Sourcery G++ Lite 4.2-47
--with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls
--prefix=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux
--with-sysroot=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux/m68k-linux-gnu/libc
--with-build-sysroot=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/libc
--enable-poison-system-directories
--with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin
--with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin
Thread model: posix
gcc version 4.2.1 (Sourcery G++ Lite 4.2-47)


I hope that this report help you to improve the quality of GCC.

Kind regards

Gunnar von Boehn


-- 
   Summary: GCC creates suboptimal ASM : Code includes unneeded TST
instructions
   Product: gcc
   Version: 4.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: gunnar at greyhound-data dot com
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: m68k-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36133