> void
> runloop(int**code) {
> static void* array[] = { &&op_end, &&op1, &&op2 };
>
> start:
> goto *array[*code];
>
> op1:
> foo; goto start;
>
> op2:
> foo; goto start;
>
> op_end:
> return;
> } // end code
> w/ code=1000, loop_cnt = 5,000,000
> gcc -O3 speeds.c
> ------
> switcher time delta=195
> caller time delta=268
> jumper time delta=185
> code compile time delta=0
> fast jumper time delta=154
>
I saw another post on the improvements provided by the simple
modification of:
goto *array[*code];
lbl1:
# do stuff
goto *array[*code];
lbl2:
# do stuff
goto *array[*code];
...
end:
return;
And they were right. I received MASSIVE performance boosts.
gcc -O3 w/ 1000 ops and 5,000,000 loops
jumper code = 31/32 seconds
fast jumper = 32/31 seconds
gcc -O3 w/ 50,000,000 ops and 10 loops
jumper code = 14 seconds
fast jumper = 13 seconds
This is a FULL 500% faster than the fastest above code. Note also
that the "fast jumper" is only on par with the portable version. (The
fast jumper modified the source-code to insert physical addresses
instead of using indexes to the array).
Two possible reasons for this. One is cache-related issues, but I
highly doubt it because even the 50Meg (non-cachable) code was
insanely fast (thanks mostly due to predictable pre-caching). The
reduction of a branch is most likely the reason. Previously the
instruction until a branch was like 2 so the pipeline couldn't be
utilized. Note that more complex instructions wouldn't help; they'd
just be more efficient. What strikes me as odd though is that I used
a constant jump (goto start). The Pentium Pro line is supposed to be
able to handle this efficiently.
The next thing to test is having 100 core op-codes. Then it'll be
interesting to see how horrible adding the conditionals are.
Theoretically we don't need any conditionals. "change_state" op-codes
can simply return NULL just like end does, thereby breaking out of the
loop automatically.
I never understood the reasoning behind while(*code) instead of just
while(code) (which is properly handled here). Obviously there's value
to while( code > start && code < end ), but that's in the non-hell-bent loop.
Oh by the way.
500M ops in 15 seconds is 1.02 Billion parrot ops / second. This isn't
too shabby on a mere 466MHZ machine. That's a scant 2.18 parrot ops per
cycle. Getting pretty close to the theoretical max that architecture can
perform. It ran even faster with the non-cached memory (200Meg sys-ram),
than with short cachable bursts but that's mostly because the prefetcher
was always correct. Note that this was only because there were only 3
assembly instructions per op-code, and all the op-codes were of the same
form (inc, get-next-op, jump). Pure-mathematical real-parrot-ops should
get pretty close to this, but most parrot-ops will have evil conditionals
that thwart such performance.
-Michael