> void
> runloop(int**code) {
> static void* array[] = { &&op_end, &&op1, &&op2 };
>
> start:
>   goto *array[*code];
>
> op1:
>   foo; goto start;
>
> op2:
>   foo; goto start;
>
> op_end:
>   return;
> } // end code
>  w/ code=1000, loop_cnt = 5,000,000
> gcc -O3 speeds.c
> ------
> switcher time delta=195
> caller time delta=268
> jumper time delta=185
> code compile time delta=0
> fast jumper time delta=154
>

I saw another post on the improvements provided by the simple
modification of:

goto *array[*code];

lbl1:
  # do stuff
  goto *array[*code];
lbl2:
  # do stuff
  goto *array[*code];
...
end:
  return;

And they were right.  I received MASSIVE performance boosts.

gcc -O3 w/ 1000 ops and 5,000,000 loops
jumper code = 31/32 seconds
fast jumper = 32/31 seconds

gcc -O3 w/ 50,000,000 ops and 10 loops
jumper code = 14 seconds
fast jumper = 13 seconds


This is a FULL 500% faster than the fastest above code.  Note also
that the "fast jumper" is only on par with the portable version.  (The
fast jumper modified the source-code to insert physical addresses
instead of using indexes to the array).

Two possible reasons for this.  One is cache-related issues, but I
highly doubt it because even the 50Meg (non-cachable) code was
insanely fast (thanks mostly due to predictable pre-caching).  The
reduction of a branch is most likely the reason.  Previously the
instruction until a branch was like 2 so the pipeline couldn't be
utilized.  Note that more complex instructions wouldn't help; they'd
just be more efficient.  What strikes me as odd though is that I used
a constant jump (goto start).  The Pentium Pro line is supposed to be
able to handle this efficiently.

The next thing to test is having 100 core op-codes.  Then it'll be
interesting to see how horrible adding the conditionals are.
Theoretically we don't need any conditionals.  "change_state" op-codes
can simply return NULL just like end does, thereby breaking out of the
loop automatically.

I never understood the reasoning behind while(*code) instead of just
while(code) (which is properly handled here).  Obviously there's value
to while( code > start && code < end ), but that's in the non-hell-bent loop.

Oh by the way.

500M ops in 15 seconds is 1.02 Billion parrot ops / second.  This isn't
too shabby on a mere 466MHZ machine.  That's a scant 2.18 parrot ops per
cycle.  Getting pretty close to the theoretical max that architecture can
perform.  It ran even faster with the non-cached memory (200Meg sys-ram),
than with short cachable bursts but that's mostly because the prefetcher
was always correct.  Note that this was only because there were only 3
assembly instructions per op-code, and all the op-codes were of the same
form (inc, get-next-op, jump).  Pure-mathematical real-parrot-ops should
get pretty close to this, but most parrot-ops will have evil conditionals
that thwart such performance.

-Michael

Reply via email to