> void > runloop(int**code) { > static void* array[] = { &&op_end, &&op1, &&op2 }; > > start: > goto *array[*code]; > > op1: > foo; goto start; > > op2: > foo; goto start; > > op_end: > return; > } // end code > w/ code=1000, loop_cnt = 5,000,000 > gcc -O3 speeds.c > ------ > switcher time delta=195 > caller time delta=268 > jumper time delta=185 > code compile time delta=0 > fast jumper time delta=154 >
I saw another post on the improvements provided by the simple modification of: goto *array[*code]; lbl1: # do stuff goto *array[*code]; lbl2: # do stuff goto *array[*code]; ... end: return; And they were right. I received MASSIVE performance boosts. gcc -O3 w/ 1000 ops and 5,000,000 loops jumper code = 31/32 seconds fast jumper = 32/31 seconds gcc -O3 w/ 50,000,000 ops and 10 loops jumper code = 14 seconds fast jumper = 13 seconds This is a FULL 500% faster than the fastest above code. Note also that the "fast jumper" is only on par with the portable version. (The fast jumper modified the source-code to insert physical addresses instead of using indexes to the array). Two possible reasons for this. One is cache-related issues, but I highly doubt it because even the 50Meg (non-cachable) code was insanely fast (thanks mostly due to predictable pre-caching). The reduction of a branch is most likely the reason. Previously the instruction until a branch was like 2 so the pipeline couldn't be utilized. Note that more complex instructions wouldn't help; they'd just be more efficient. What strikes me as odd though is that I used a constant jump (goto start). The Pentium Pro line is supposed to be able to handle this efficiently. The next thing to test is having 100 core op-codes. Then it'll be interesting to see how horrible adding the conditionals are. Theoretically we don't need any conditionals. "change_state" op-codes can simply return NULL just like end does, thereby breaking out of the loop automatically. I never understood the reasoning behind while(*code) instead of just while(code) (which is properly handled here). Obviously there's value to while( code > start && code < end ), but that's in the non-hell-bent loop. Oh by the way. 500M ops in 15 seconds is 1.02 Billion parrot ops / second. This isn't too shabby on a mere 466MHZ machine. That's a scant 2.18 parrot ops per cycle. Getting pretty close to the theoretical max that architecture can perform. It ran even faster with the non-cached memory (200Meg sys-ram), than with short cachable bursts but that's mostly because the prefetcher was always correct. Note that this was only because there were only 3 assembly instructions per op-code, and all the op-codes were of the same form (inc, get-next-op, jump). Pure-mathematical real-parrot-ops should get pretty close to this, but most parrot-ops will have evil conditionals that thwart such performance. -Michael