A project that I’m working on at the moment calls for a very small footprint. This post is about how to make code really small for the ARM architecture.

As you can probably guess, I’m interested in operating system code, so as an example, I’ve taken a very simple piece of operating system code, and stripped it right back to demonstrate some of the techniques to use when optimising for space. So here is the snippet of code we are going to optimise:

01 struct thread {
02     unsigned int notify_bits;
03 };
04 
05 unsigned int
06 poll(struct thread *thread, unsigned int mask)
07 {
08     unsigned int result;
09     result = thread->notify_bits & mask;
10     if (result) {
11         thread->notify_bits &= ~result;
12     }
13     return result;
14 }

In this very simple operating system we have threads (thread data is stored in struct thread). Each thread has a set of 32 signals (encoded in a single word notify_bits). This poll is used by a thread to determine if it has been sent any signals. The mask parameter is the set of signals that the thread is interested in checking. So, a thread can check if a single signal has been thread, or if any signal has been set, or if a specific subset of signals has been set. The function returns the signals that are available (which is simply the bit-wise and of notify_bits and mask). It is important that the function clears any signals that have been returned. This makes sure that if poll is called twice the same signals are not returned. This is achieved in lines 10—12.

So, our goal is to try and get this code as small as possible. And every byte counts! First off, we just try and compile this with the standard ARM gcc compiler. I’m using version 4.2.2. So we start with: $ arm-elf-gcc -c poll.c -o poll.o. We can then use object dump to work out what the compiler did: $ arm-elf-objdump -dl poll.o.

00000000 <poll&ht;:
poll():
   0:	e1a0c00d 	mov	ip, sp
   4:	e92dd800 	push	{fp, ip, lr, pc}
   8:	e24cb004 	sub	fp, ip, #4	; 0x4
   c:	e24dd00c 	sub	sp, sp, #12	; 0xc
  10:	e50b0014 	str	r0, [fp, #-20]
  14:	e50b1018 	str	r1, [fp, #-24]
  18:	e51b3014 	ldr	r3, [fp, #-20]
  1c:	e5932000 	ldr	r2, [r3]
  20:	e51b3018 	ldr	r3, [fp, #-24]
  24:	e0023003 	and	r3, r2, r3
  28:	e50b3010 	str	r3, [fp, #-16]
  2c:	e51b3010 	ldr	r3, [fp, #-16]
  30:	e3530000 	cmp	r3, #0	; 0x0
  34:	0a000006 	beq	54 <poll+0x54>
  38:	e51b3014 	ldr	r3, [fp, #-20]
  3c:	e5932000 	ldr	r2, [r3]
  40:	e51b3010 	ldr	r3, [fp, #-16]
  44:	e1e03003 	mvn	r3, r3
  48:	e0022003 	and	r2, r2, r3
  4c:	e51b3014 	ldr	r3, [fp, #-20]
  50:	e5832000 	str	r2, [r3]
  54:	e51b3010 	ldr	r3, [fp, #-16]
  58:	e1a00003 	mov	r0, r3
  5c:	e24bd00c 	sub	sp, fp, #12	; 0xc
  60:	e89da800 	ldm	sp, {fp, sp, pc}

So, our first go gets us 100 bytes of code. Which is 10 bytes (or 2.5 instructions) for each of our lines of code. We should be able to do better. Well, the first thing is we should try is to use optimisation: $ arm-elf-gcc -c poll.c -o poll.o -O2. This gives us a much better code generation output:

00000000 <poll>:
poll():
   0:	e5902000 	ldr	r2, [r0]
   4:	e1a0c000 	mov	ip, r0
   8:	e0110002 	ands	r0, r1, r2
   c:	11e03000 	mvnne	r3, r0
  10:	10033002 	andne	r3, r3, r2
  14:	158c3000 	strne	r3, [ip]
  18:	e12fff1e 	bx	lr

So this got us down to 28 bytes. A factor 4 improvement for one compiler flag, not bad. Now, -O2 does some standard omptimisations, but -Os, will do optimisations specifically for reducing the amount of code. So trying: $ arm-elf-gcc -c poll.c -o poll.o -Os. This gives a little bit better code-gen:

00000000 <poll>:
poll():
   0:	e5903000 	ldr	r3, [r0]
   4:	e1a02000 	mov	r2, r0
   8:	e0110003 	ands	r0, r1, r3
   c:	11c33000 	bicne	r3, r3, r0
  10:	15823000 	strne	r3, [r2]
  14:	e12fff1e 	bx	lr

Down to 24 bytes (6 instructions), is pretty good. Now, as you can see the generated code has 32-bits per instruction. The some of the ARM architectures have two distinct instruction sets, ARM and Thumb. The Thumb instruction set uses 16-bit per instruction, instead of 32-bit. This denser instruction set can enable much smaller code sizes. Of course there is a trade-off here. The functionality of the 16-bit instructions is going to be less than the 32-bit instructions. But lets give it a try. At the same time, we will tell the compiler the exact CPU we want to compile for (which is the ARM7TDMI-S) in our case. The compiler line is: $ arm-elf-gcc -c poll.c -o poll.o -Os -mcpu=arm7tdmi -mthumb. Which produces code like:

00000000 <poll>:
poll():
   0:	6803      	ldr	r3, [r0, #0]
   2:	1c02      	adds	r2, r0, #0
   4:	1c08      	adds	r0, r1, #0
   6:	4018      	ands	r0, r3
   8:	d001      	beq.n	e <poll+0xe>
   a:	4383      	bics	r3, r0
   c:	6013      	str	r3, [r2, #0]
   e:	4770      	bx	lr

So, now we are down to 16 bytes, so in Thumb we need 8 instructions (2 more than ARM), but each is only 2 bytes, not 4, so we end up with a 1/3 improvement. To get any further, we need to start looking at our code again, and see if there are ways of improving the code. Looking at the code again:

00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03     unsigned int result;
04     result = thread->notify_bits & mask;
05     if (result) {
06         thread->notify_bits &= ~result;
07     }
08     return result;
09 }

You may notice that the branch instruction on line 5, you may notice that this is actually redundant. If result is zero, then ~result well be 0xffffffff. Given this thread->notify_bits &= 0xffffffff will not change the value of thread->notify_bits. So, we can reduce this to:

00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03     unsigned int result;
04     result = thread->notify_bits & mask;
05     thread->notify_bits &= ~result;
06     return result;
07 }

When we compile this we get down to:

00000000 <poll>:
poll():
   0:	6803      	ldr	r3, [r0, #0]
   2:	4019      	ands	r1, r3
   4:	438b      	bics	r3, r1
   6:	6003      	str	r3, [r0, #0]
   8:	1c08      	adds	r0, r1, #0
   a:	4770      	bx	lr

This gets us down to 6 instructions (12 bytes). Pretty good since we started at 100 bytes. Now lets look at the object code in a little bit more detail. If you look at address 0x8, the instruction simply moves register r1 into register r0 so that it in the right place for return. (Note: The ARM ABI has the return value stored in r0). This seems like a bit of a waste, it would be good if there was a way we could have the value stored in r0 and not waste an instruction just moving values between registers. Now, to get a better understanding of what the instructions are doing, I’m going to slightly rewrite the code, and then compile with debugging, so we can see how the generated code matches up with the source code. So first, lets rewrite the code a little bit:

00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03     unsigned int tmp_notify_bits = thread->notify_bits;
04     mask &= tmp_notify_bits;
05     tmp_notify_bits &= ~mask;
06     thread->notify_bits = tmp_notify_bits;
07     return mask;
08 }

You should convince yourself that this code is equivalent to the existing code. (The code generated is identical). So, we can now line up the source code with the generated code. On line 03 (unsigned int tmp_notify_bits = thread->notify_bits), this matches up with address 0x0 (ldr r3, [r0, #0]). So, register r3 is used to store variable tmp_notify_bits. The parameter thread is stored in register r0. Now, line 04 (mask &= tmp_notify_bits) matches directly with address 0x2 (ands r1, r3). Register r1 matches directly with the mask parameter. The important part of restructuring the code as we have done, is that it becomes obvious that we can directly using the mask parameter instead of needing an extra variable like in the previous code. As we continue line 05 (tmp_notify_bits &= ~mask), matches directly to 0x4 (bics r3, r1). The bics instruction is quite neat in that it can essentially do the &= ~ in one 16-bit instruction. Line 06 (thread->notify_bits = tmp_notify_bits) stores the result back to memory, matches directly to 0x6. Now the final line of code (return mask;), needs two instructions 0x8 and 0xa (adds r0, r1, #0 and bx lr). Now the reason we need to instructions is because mask is stored in register r1, and the return value needs to be in register r0. So how can we get mask to be stored in r0 all along? Well, if we switch the parameters around poll(unsigned int mask, struct thread *thread), the mask will instead be stored in r0 instead of r1. (Note: We don’t necessarily have to change the interface of the function. If we want to support keep the interface we can use a simple macro to textually swap the parameters.) If we compile this, we get the following generated code:

00000000 <poll>:
poll():
   0:	680b      	ldr	r3, [r1, #0]
   2:	4018      	ands	r0, r3
   4:	4383      	bics	r3, r0
   6:	600b      	str	r3, [r1, #0]
   8:	4770      	bx	lr

So, we have got this down to 5 instructions, 10 bytes. This is a factor 10 improvement, not bad!

So a bit of a review:

Compile to Thumb instructions.
Use -Os to optimise for space.
Look for unnecessary code.
Look for examples of registers used in the wrong place.








 

-- 

To unsubscribe, reply using "remove me" as the subject.

[linuxkernelnewbies] Small code

Reply via email to