|
http://benno.id.au/blog/2009/01/01/space_conserving_code A project that I’m working on at the moment calls for a very small footprint. This post is about how to make code really small for the ARM architecture. As you can probably guess, I’m interested in operating system code, so as an example, I’ve taken a very simple piece of operating system code, and stripped it right back to demonstrate some of the techniques to use when optimising for space. So here is the snippet of code we are going to optimise: 01 struct thread {
02 unsigned int notify_bits;
03 };
04
05 unsigned int
06 poll(struct thread *thread, unsigned int mask)
07 {
08 unsigned int result;
09 result = thread->notify_bits & mask;
10 if (result) {
11 thread->notify_bits &= ~result;
12 }
13 return result;
14 }
In this very simple operating system we have threads (thread data
is stored in So, our goal is to try and get this code as small as possible. And every
byte counts! First off, we just try and compile
this with the standard ARM gcc compiler. I’m using version 4.2.2.
So we start with: 00000000 <poll&ht;:
poll():
0: e1a0c00d mov ip, sp
4: e92dd800 push {fp, ip, lr, pc}
8: e24cb004 sub fp, ip, #4 ; 0x4
c: e24dd00c sub sp, sp, #12 ; 0xc
10: e50b0014 str r0, [fp, #-20]
14: e50b1018 str r1, [fp, #-24]
18: e51b3014 ldr r3, [fp, #-20]
1c: e5932000 ldr r2, [r3]
20: e51b3018 ldr r3, [fp, #-24]
24: e0023003 and r3, r2, r3
28: e50b3010 str r3, [fp, #-16]
2c: e51b3010 ldr r3, [fp, #-16]
30: e3530000 cmp r3, #0 ; 0x0
34: 0a000006 beq 54 <poll+0x54>
38: e51b3014 ldr r3, [fp, #-20]
3c: e5932000 ldr r2, [r3]
40: e51b3010 ldr r3, [fp, #-16]
44: e1e03003 mvn r3, r3
48: e0022003 and r2, r2, r3
4c: e51b3014 ldr r3, [fp, #-20]
50: e5832000 str r2, [r3]
54: e51b3010 ldr r3, [fp, #-16]
58: e1a00003 mov r0, r3
5c: e24bd00c sub sp, fp, #12 ; 0xc
60: e89da800 ldm sp, {fp, sp, pc}
So, our first go gets us 100 bytes of code. Which is 10 bytes
(or 2.5 instructions) for each of our lines of code. We should be able
to do better. Well, the first thing is we should try is to use
optimisation: 00000000 <poll>: poll(): 0: e5902000 ldr r2, [r0] 4: e1a0c000 mov ip, r0 8: e0110002 ands r0, r1, r2 c: 11e03000 mvnne r3, r0 10: 10033002 andne r3, r3, r2 14: 158c3000 strne r3, [ip] 18: e12fff1e bx lr So this got us down to 28 bytes. A factor 4 improvement for one
compiler flag, not bad. Now, 00000000 <poll>: poll(): 0: e5903000 ldr r3, [r0] 4: e1a02000 mov r2, r0 8: e0110003 ands r0, r1, r3 c: 11c33000 bicne r3, r3, r0 10: 15823000 strne r3, [r2] 14: e12fff1e bx lr
Down to 24 bytes (6 instructions), is pretty good. Now, as you can
see the generated code has 32-bits per instruction. The some of the
ARM architectures have two distinct instruction sets,
ARM and Thumb. The Thumb instruction
set uses 16-bit per instruction, instead of 32-bit. This denser
instruction set can enable much smaller code sizes. Of course there is
a trade-off here. The functionality of the 16-bit instructions is
going to be less than the 32-bit instructions. But lets give it a
try. At the same time, we will tell the compiler the exact CPU we want
to compile for (which is the ARM7TDMI-S) in our case. The compiler
line is: 00000000 <poll>: poll(): 0: 6803 ldr r3, [r0, #0] 2: 1c02 adds r2, r0, #0 4: 1c08 adds r0, r1, #0 6: 4018 ands r0, r3 8: d001 beq.n e <poll+0xe> a: 4383 bics r3, r0 c: 6013 str r3, [r2, #0] e: 4770 bx lr So, now we are down to 16 bytes, so in Thumb we need 8 instructions (2 more than ARM), but each is only 2 bytes, not 4, so we end up with a 1/3 improvement. To get any further, we need to start looking at our code again, and see if there are ways of improving the code. Looking at the code again: 00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03 unsigned int result;
04 result = thread->notify_bits & mask;
05 if (result) {
06 thread->notify_bits &= ~result;
07 }
08 return result;
09 }
You may notice that the branch instruction on line 5, you may
notice that this is actually redundant. If 00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03 unsigned int result;
04 result = thread->notify_bits & mask;
05 thread->notify_bits &= ~result;
06 return result;
07 }
When we compile this we get down to: 00000000 <poll>: poll(): 0: 6803 ldr r3, [r0, #0] 2: 4019 ands r1, r3 4: 438b bics r3, r1 6: 6003 str r3, [r0, #0] 8: 1c08 adds r0, r1, #0 a: 4770 bx lr This gets us down to 6 instructions (12 bytes). Pretty good since
we started at 100 bytes. Now lets look at the object code in a little
bit more detail. If you look at address 0x8, the instruction simply
moves register 00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03 unsigned int tmp_notify_bits = thread->notify_bits;
04 mask &= tmp_notify_bits;
05 tmp_notify_bits &= ~mask;
06 thread->notify_bits = tmp_notify_bits;
07 return mask;
08 }
You should convince yourself that this code is equivalent to the
existing code. (The code generated is identical). So, we can now line
up the source code with the generated code. On line 03 ( 00000000 <poll>: poll(): 0: 680b ldr r3, [r1, #0] 2: 4018 ands r0, r3 4: 4383 bics r3, r0 6: 600b str r3, [r1, #0] 8: 4770 bx lr So, we have got this down to 5 instructions, 10 bytes. This is a factor 10 improvement, not bad! So a bit of a review:
|
