[Bug middle-end/27016] ARM optimizer produces severely suboptimal code
--- Comment #6 from steven at gcc dot gnu dot org 2010-01-02 01:05 --- Comment #3 and comment #5 still true with GCC: (GNU) 4.5.0 20090808 (experimental) [trunk revision 150579] - reconfirmed. -- steven at gcc dot gnu dot org changed: What|Removed |Added Last reconfirmed|2009-06-30 11:35:26 |2010-01-02 01:05:52 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
[Bug middle-end/27016] ARM optimizer produces severely suboptimal code
--- Comment #3 from steven at gcc dot gnu dot org 2009-06-30 11:35 --- For this test case: unsigned int code_in_ram[100]; void testme(void) { unsigned int *p_rom, *p_ram, *p_end, len; extern unsigned int _ram_erase_sector_start; extern unsigned int _ram_erase_sector_end; p_ram = code_in_ram; p_rom = _ram_erase_sector_start; len = ((unsigned int)_ram_erase_sector_end - (unsigned int)_ram_erase_sector_start) / sizeof(unsigned int); for (len = 100; len 0; --len) { *p_ram++ = *p_rom++; } } I get the following code with a 4.5.0 checkout from a few days ago (see .ident): .cpu arm7tdmi .fpu softvfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 1 .eabi_attribute 30, 4 .eabi_attribute 18, 4 .file t.c .text .align 2 .global testme .type testme, %function testme: @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r1, .L5 ldr r2, .L5+4 mov r3, #0 .L2: ldr r0, [r2, r3] str r0, [r1, r3] add r3, r3, #4 cmp r3, #400 bne .L2 bx lr .L6: .align 2 .L5: .word code_in_ram .word _ram_erase_sector_start .size testme, .-testme .comm code_in_ram,400,4 .ident GCC: (GNU) 4.5.0 20090626 (experimental) [trunk revision 148960] This looks marginally better than the code of comment #2. With the original test case of comment #0 I get the following code: .cpu arm7tdmi .fpu softvfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 1 .eabi_attribute 30, 4 .eabi_attribute 18, 4 .file t.c .text .align 2 .global testme .type testme, %function testme: @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r2, .L5 ldr r3, .L5+4 ldr r1, .L5+8 b .L2 .L3: ldr r0, [r3], #4 str r0, [r2, #-4] .L2: cmp r3, r1 add r2, r2, #4 bcc .L3 bx lr .L6: .align 2 .L5: .word code_in_ram .word _ram_erase_sector_start .word _ram_erase_sector_end .size testme, .-testme .comm code_in_ram,400,4 .ident GCC: (GNU) 4.5.0 20090626 (experimental) [trunk revision 148960] The ldr r3, .L6+8 line of comment #0 is now hoisted from the loop into r1 in ldr r1, .L5+8. GCC still does not use a post-increment for the store in the loop (see use of r2), even though it uses a post-increment for the load. So, bug still is there. Might be a dup of one of those many other bugs about poor use of auto-increment instructions by GCC... ;-) -- steven at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever Confirmed|0 |1 Last reconfirmed|-00-00 00:00:00 |2009-06-30 11:35:26 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
[Bug middle-end/27016] ARM optimizer produces severely suboptimal code
--- Comment #4 from steven at gcc dot gnu dot org 2009-06-30 13:21 --- The auto-inc-dec pass fails because the store and the reg increment are not in the same basic block. The dump of the pass before auto-inc-dec (reginfo) looks like this: ;; Function testme (testme) 74 NOTE_INSN_BASIC_BLOCK 72 NOTE_INSN_FUNCTION_BEG 76 r205:SI=`code_in_ram' 73 r203:SI=`_ram_erase_sector_start' 87 r208:SI=`_ram_erase_sector_end' L86: 79 NOTE_INSN_BASIC_BLOCK 80 r206:SI=[r203:SI] 81 [r205:SI-0x4]=r206:SI REG_DEAD: r206:SI 82 r203:SI=r203:SI+0x4 L83: 84 NOTE_INSN_BASIC_BLOCK 85 r205:SI=r205:SI+0x4 88 cc:CC=cmp(r203:SI,r208:SI) REG_EQUAL: cmp(r203:SI,`_ram_erase_sector_end') 89 pc={(ltu(cc:CC,0x0))?L86:pc} REG_DEAD: cc:CC REG_BR_PROB: 0x238c 92 NOTE_INSN_BASIC_BLOCK Then auto-inc-dec comes along and notices the opportunity to merge the load and increment in insns 80 and 82: starting bb 3 82 r203:SI=r203:SI+0x4 81 [r205:SI-0x4]=r206:SI REG_DEAD: r206:SI 81 [r205:SI-0x4]=r206:SI REG_DEAD: r206:SI found mem(81) *(r[205]+-4) 80 r206:SI=[r203:SI] 80 r206:SI=[r203:SI] found mem(80) *(r[203]+0) 82 r203:SI=r203:SI+0x4 found post inc(82) r[203]+=4 trying SIMPLE_POST_INC rescanning insn with uid = 80. deleting insn with uid = 80. deleting insn with uid = 82. success80 r206:SI=[r203:SI++] REG_INC: r203:SI Merging the store and increment of insns 81 and 85 is never tried because the insns are not in the same basic block. Bernd Schmidt has patches that probably address this case of insns in different basic blocks. I will give his patches a try to see if it helps for ARM (the patches were written for Blackfin). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
[Bug middle-end/27016] ARM optimizer produces severely suboptimal code
--- Comment #5 from steven at gcc dot gnu dot org 2009-06-30 13:27 --- Compiling with ./cc1 -Os t.c -fno-ivopts I get the following code: .global testme .type testme, %function testme: @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r2, .L5 ldr r3, .L5+4 ldr r1, .L5+8 b .L2 .L3: ldr r0, [r3], #4 str r0, [r2], #4 .L2: cmp r3, r1 bcc .L3 bx lr -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
[Bug middle-end/27016] ARM optimizer produces severely suboptimal code
--- Comment #1 from pinskia at gcc dot gnu dot org 2006-04-04 08:16 --- This code is undefined: len = ((unsigned int)_ram_erase_sector_end - (unsigned int)_ram_erase_sector_start) / sizeof(unsigned int); That is obviously undefined as taking the difference between two pointers which are not in the same array is undefined code. Even the comparision: p_rom p_end; is undefined. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
[Bug middle-end/27016] ARM optimizer produces severely suboptimal code
--- Comment #2 from Eric dot Doenges at betty-tv dot com 2006-04-04 09:13 --- (In reply to comment #1) This code is undefined: len = ((unsigned int)_ram_erase_sector_end - (unsigned int)_ram_erase_sector_start) / sizeof(unsigned int); That is obviously undefined as taking the difference between two pointers which are not in the same array is undefined code. Even the comparision: p_rom p_end; is undefined. In the code I took this snippet from, _ram_erase_sector_start and _ram_erase_sector_end are symbols generated by the linker at the start and the end of a special segment which I need to copy to ram, so I would argue that these pointers do in fact refer to the same array (in this case, the array is the entire flash memory). However, none of this should affect the decision to use (or not to use) the post-indexed addressing mode. If I replace the for loop with a for (len = 100; len 0; --len), the quality of the generated code actually degrades even further: ldr r2, .L7 ldr r1, .L7+4 @ lr needed for prologue .L2: ldr r3, [r1, #-4] str r3, [r2, #-4] ldr r3, .L7+8 add r2, r2, #4 cmp r2, r3 add r1, r1, #4 bne .L2 bx lr While I thinks it's nifty that gcc recognizes that it doesn't need to keep the len variable, but instead uses p_ram to determine when the loop is finished, I also think it's pretty brain-dead that it won't use post-indexed addressing for either the ldr or str in the loop. And why it thinks it needs to load the constant end address to compare against every time inside the loop instead of once into a scratch register outside the loop is anyone's guess. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016