[Bug middle-end/27016] ARM optimizer produces severely suboptimal code

2010-01-01 Thread steven at gcc dot gnu dot org


--- Comment #6 from steven at gcc dot gnu dot org  2010-01-02 01:05 ---
Comment #3 and comment #5 still true with GCC: (GNU) 4.5.0 20090808
(experimental) [trunk revision 150579] - reconfirmed.


-- 

steven at gcc dot gnu dot org changed:

   What|Removed |Added

   Last reconfirmed|2009-06-30 11:35:26 |2010-01-02 01:05:52
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016



[Bug middle-end/27016] ARM optimizer produces severely suboptimal code

2009-06-30 Thread steven at gcc dot gnu dot org


--- Comment #3 from steven at gcc dot gnu dot org  2009-06-30 11:35 ---
For this test case:

unsigned int code_in_ram[100];

void testme(void)
{
  unsigned int *p_rom, *p_ram, *p_end, len;

  extern unsigned int _ram_erase_sector_start;
  extern unsigned int _ram_erase_sector_end;

  p_ram = code_in_ram;
  p_rom = _ram_erase_sector_start;
  len = ((unsigned int)_ram_erase_sector_end
 - (unsigned int)_ram_erase_sector_start) / sizeof(unsigned int);

  for (len = 100; len  0; --len)
{
  *p_ram++ = *p_rom++;
}
}


I get the following code with a 4.5.0 checkout from a few days ago (see
.ident):

.cpu arm7tdmi
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 4
.eabi_attribute 18, 4
.file   t.c
.text
.align  2
.global testme
.type   testme, %function
testme:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r1, .L5
ldr r2, .L5+4
mov r3, #0
.L2:
ldr r0, [r2, r3]
str r0, [r1, r3]
add r3, r3, #4
cmp r3, #400
bne .L2
bx  lr
.L6:
.align  2
.L5:
.word   code_in_ram
.word   _ram_erase_sector_start
.size   testme, .-testme
.comm   code_in_ram,400,4
.ident  GCC: (GNU) 4.5.0 20090626 (experimental) [trunk revision
148960]


This looks marginally better than the code of comment #2.



With the original test case of comment #0 I get the following code:

.cpu arm7tdmi
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 4
.eabi_attribute 18, 4
.file   t.c
.text
.align  2
.global testme
.type   testme, %function
testme:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r2, .L5
ldr r3, .L5+4
ldr r1, .L5+8
b   .L2
.L3:
ldr r0, [r3], #4
str r0, [r2, #-4]
.L2:
cmp r3, r1
add r2, r2, #4
bcc .L3
bx  lr
.L6:
.align  2
.L5:
.word   code_in_ram
.word   _ram_erase_sector_start
.word   _ram_erase_sector_end
.size   testme, .-testme
.comm   code_in_ram,400,4
.ident  GCC: (GNU) 4.5.0 20090626 (experimental) [trunk revision
148960]


The ldr r3, .L6+8 line of comment #0 is now hoisted from the loop into r1
in ldr r1, .L5+8.  GCC still does not use a post-increment for the store
in the loop (see use of r2), even though it uses a post-increment for the load.

So, bug still is there.  Might be a dup of one of those many other bugs about
poor use of auto-increment instructions by GCC... ;-)


-- 

steven at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever Confirmed|0   |1
   Last reconfirmed|-00-00 00:00:00 |2009-06-30 11:35:26
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016



[Bug middle-end/27016] ARM optimizer produces severely suboptimal code

2009-06-30 Thread steven at gcc dot gnu dot org


--- Comment #4 from steven at gcc dot gnu dot org  2009-06-30 13:21 ---
The auto-inc-dec pass fails because the store and the reg increment are not in
the same basic block.  The dump of the pass before auto-inc-dec (reginfo) looks
like this:

;; Function testme (testme)

   74 NOTE_INSN_BASIC_BLOCK
   72 NOTE_INSN_FUNCTION_BEG
   76 r205:SI=`code_in_ram'
   73 r203:SI=`_ram_erase_sector_start'
   87 r208:SI=`_ram_erase_sector_end'
L86:
   79 NOTE_INSN_BASIC_BLOCK
   80 r206:SI=[r203:SI]
   81 [r205:SI-0x4]=r206:SI
  REG_DEAD: r206:SI
   82 r203:SI=r203:SI+0x4
L83:
   84 NOTE_INSN_BASIC_BLOCK
   85 r205:SI=r205:SI+0x4
   88 cc:CC=cmp(r203:SI,r208:SI)
  REG_EQUAL: cmp(r203:SI,`_ram_erase_sector_end')
   89 pc={(ltu(cc:CC,0x0))?L86:pc}
  REG_DEAD: cc:CC
  REG_BR_PROB: 0x238c
   92 NOTE_INSN_BASIC_BLOCK


Then auto-inc-dec comes along and notices the opportunity to merge the load and
increment in insns 80 and 82:

starting bb 3
   82 r203:SI=r203:SI+0x4
   81 [r205:SI-0x4]=r206:SI
  REG_DEAD: r206:SI
   81 [r205:SI-0x4]=r206:SI
  REG_DEAD: r206:SI
found mem(81) *(r[205]+-4)
   80 r206:SI=[r203:SI]
   80 r206:SI=[r203:SI]
found mem(80) *(r[203]+0)
   82 r203:SI=r203:SI+0x4
found post inc(82) r[203]+=4
trying SIMPLE_POST_INC
rescanning insn with uid = 80.
deleting insn with uid = 80.
deleting insn with uid = 82.
success80 r206:SI=[r203:SI++]
  REG_INC: r203:SI


Merging the store and increment of insns 81 and 85 is never tried because the
insns are not in the same basic block.

Bernd Schmidt has patches that probably address this case of insns in different
basic blocks.  I will give his patches a try to see if it helps for ARM (the
patches were written for Blackfin).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016



[Bug middle-end/27016] ARM optimizer produces severely suboptimal code

2009-06-30 Thread steven at gcc dot gnu dot org


--- Comment #5 from steven at gcc dot gnu dot org  2009-06-30 13:27 ---
Compiling with ./cc1 -Os t.c -fno-ivopts I get the following code:

.global testme
.type   testme, %function
testme:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r2, .L5
ldr r3, .L5+4
ldr r1, .L5+8
b   .L2
.L3:
ldr r0, [r3], #4
str r0, [r2], #4
.L2:
cmp r3, r1
bcc .L3
bx  lr


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016



[Bug middle-end/27016] ARM optimizer produces severely suboptimal code

2006-04-04 Thread pinskia at gcc dot gnu dot org


--- Comment #1 from pinskia at gcc dot gnu dot org  2006-04-04 08:16 ---
This code is undefined:
  len = ((unsigned int)_ram_erase_sector_end 
 - (unsigned int)_ram_erase_sector_start) / sizeof(unsigned int);

That is obviously undefined as taking the difference between two pointers which
are not in the same array is undefined code.

Even the comparision:
p_rom  p_end;
is undefined.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016



[Bug middle-end/27016] ARM optimizer produces severely suboptimal code

2006-04-04 Thread Eric dot Doenges at betty-tv dot com


--- Comment #2 from Eric dot Doenges at betty-tv dot com  2006-04-04 09:13 
---
(In reply to comment #1)
 This code is undefined:
   len = ((unsigned int)_ram_erase_sector_end 
  - (unsigned int)_ram_erase_sector_start) / sizeof(unsigned int);
 
 That is obviously undefined as taking the difference between two pointers 
 which
 are not in the same array is undefined code.
 
 Even the comparision:
 p_rom  p_end;
 is undefined.
 
In the code I took this snippet from, _ram_erase_sector_start and
_ram_erase_sector_end are symbols
generated by the linker at the start and the end of a special segment which I
need to copy to ram,
so I would argue that these pointers do in fact refer to the same array (in
this case, the array is the
entire flash memory). 

However, none of this should affect the decision to use (or not to use) the
post-indexed addressing
mode. If I replace the for loop with a for (len = 100; len  0; --len), the
quality of the generated code
actually degrades even further:

ldr r2, .L7
ldr r1, .L7+4
@ lr needed for prologue
.L2:
ldr r3, [r1, #-4]
str r3, [r2, #-4]
ldr r3, .L7+8
add r2, r2, #4
cmp r2, r3
add r1, r1, #4
bne .L2
bx  lr

While I thinks it's nifty that gcc recognizes that it doesn't need to keep the
len variable, but instead
uses p_ram to determine when the loop is finished, I also think it's pretty
brain-dead that it won't
use post-indexed addressing for either the ldr or str in the loop. And why it
thinks it needs to load
the constant end address to compare against every time inside the loop instead
of once into a scratch
register outside the loop is anyone's guess.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016