Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On Tuesday 09 October 2001 02:11 pm, Michael Maraist wrote: > I saw another post on the improvements provided by the simple > modification of: > > goto *array[*code]; > > lbl1: > # do stuff > goto *array[*code]; > lbl2: > # do stuff > goto *array[*code]; > ... > end: > return; > > And they were right. I received MASSIVE performance boosts. Yeah, this is what I coded up into the current Parrot baseline. I went from 46M to 113M ops/sec. (On test.pasm. This is my Linux/Athlon system. The 46M is normalized to a no-input-checking dispatch loop.) > The next thing to test is having 100 core op-codes. Then it'll be > interesting to see how horrible adding the conditionals are. > Theoretically we don't need any conditionals. "change_state" op-codes > can simply return NULL just like end does, thereby breaking out of the > loop automatically. Except how do you determine at compile time when to insert change state opcodes? I'm convinced (and humbled :-) by this approach (even though it makes gcc scream under Wap rules), but I'm not convinced that we'll have an all-inclusive static set of opcode functions. (Which means that you'll still have to check for a valid target.) > > I never understood the reasoning behind while(*code) instead of just > while(code) (which is properly handled here). Obviously there's value > to while( code > start && code < end ), but that's in the non-hell-bent > loop. As I previously posted, that was probably my typo (left over from caching the value on x86) and should have been pc. -- Bryan C. Warnock [EMAIL PROTECTED]
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
At 03:59 PM 10/9/2001 -0400, Michael Fischer wrote: >Declaring that static void * array[] = {...} in a header, >produced by process_opfunc.pl. Dan, Michael, Bryan, >care for me to revamp my patch to print out a goto >table like the above? Sure, if it works and doesn't get in the way of doing it differently. >Dan, so out of all this, will vtable funcs have to be >represented in basic_opcodes.ops, or are they above all >that, and conversions (if any) done elsewhere? Vtable code goes elsewhere. >Or, more >cogently (*cough*), is the method of writing more stuff >in basic_opcodes.ops and letting the perl scripts handle >the conversion to C code still valid regardless of what >we do with vtables, modules, etc.? Yup. While the vtable code will be processed differently, it'll still be processed. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On Oct 09, Michael Maraist <[EMAIL PROTECTED]> took up a keyboard and banged out > > void > > runloop(int**code) { > > static void* array[] = { &&op_end, &&op1, &&op2 }; > > > > start: > > goto *array[*code]; > > > > op1: > > foo; goto start; > > > > op2: > > foo; goto start; > > > > op_end: > > return; > > } // end code Declaring that static void * array[] = {...} in a header, produced by process_opfunc.pl. Dan, Michael, Bryan, care for me to revamp my patch to print out a goto table like the above? Dan, so out of all this, will vtable funcs have to be represented in basic_opcodes.ops, or are they above all that, and conversions (if any) done elsewhere? Or, more cogently (*cough*), is the method of writing more stuff in basic_opcodes.ops and letting the perl scripts handle the conversion to C code still valid regardless of what we do with vtables, modules, etc.? Or will that stuff call for a change to the design of those perl scripts, and the .ops file? Michael Fischer -- Developer, "Beware of bugs in the above code. [EMAIL PROTECTED] I have only proven it correct, not tested it." [EMAIL PROTECTED]-- Donald Knuth
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On Oct 09, Michael Maraist <[EMAIL PROTECTED]> took up a keyboard and banged out > > void > > runloop(int**code) { > > static void* array[] = { &&op_end, &&op1, &&op2 }; > > > > start: > > goto *array[*code]; > > > > op1: > > foo; goto start; > > > > op2: > > foo; goto start; > > > > op_end: > > return; > > } // end code Declaring that static void * array[] = {...} in a header, produced by process_opfunc.pl. Dan, Michael, Bryan, care for me to revamp my patch to print out a goto table like the above? Dan, so out of all this, will vtable funcs have to be represented in basic_opcodes.ops, or are they above all that, and conversions (if any) done elsewhere? Or, more cogently (*cough*), is the method of writing more stuff in basic_opcodes.ops and letting the perl scripts handle the conversion to C code still valid regardless of what we do with vtables, modules, etc.? Or will that stuff call for a change to the design of those perl scripts, and the .ops file? Michael Fischer -- Developer, "Beware of bugs in the above code. [EMAIL PROTECTED] I have only proven it correct, not tested it." [EMAIL PROTECTED]-- Donald Knuth
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
At 12:54 PM 10/9/2001 -0400, Bryan C. Warnock wrote: >We are going to need to come up with a better strategy than a DO_OP macro, >though. DO_OP can certainly be a placeholder we rip out and replace as part of the configuration process. We're not limited to C's macro system if we don't want it. Heck, the whole interpreter loop can be reduced to a RUN_SOME_CODE macro that's ripped out and replaced if we want. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
At 02:11 PM 10/9/2001 -0400, Michael Maraist wrote: >Oh by the way. > >500M ops in 15 seconds is 1.02 Billion parrot ops / second. This isn't >too shabby on a mere 466MHZ machine. The performance is impressive regardless, but isn't that 33M parrot ops/sec rather than 1.02G ops/sec? >Pure-mathematical real-parrot-ops should >get pretty close to this, but most parrot-ops will have evil conditionals >that thwart such performance. Stamping out evil conditionals is one of the reasons for the variable vtables. We can't get rid of *all* the conditionals (it'd be a dull program that didn't make some decision... :) but if we're going to blow pipeline we might as well do it once and real big. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
> void > runloop(int**code) { > static void* array[] = { &&op_end, &&op1, &&op2 }; > > start: > goto *array[*code]; > > op1: > foo; goto start; > > op2: > foo; goto start; > > op_end: > return; > } // end code > w/ code=1000, loop_cnt = 5,000,000 > gcc -O3 speeds.c > -- > switcher time delta=195 > caller time delta=268 > jumper time delta=185 > code compile time delta=0 > fast jumper time delta=154 > I saw another post on the improvements provided by the simple modification of: goto *array[*code]; lbl1: # do stuff goto *array[*code]; lbl2: # do stuff goto *array[*code]; ... end: return; And they were right. I received MASSIVE performance boosts. gcc -O3 w/ 1000 ops and 5,000,000 loops jumper code = 31/32 seconds fast jumper = 32/31 seconds gcc -O3 w/ 50,000,000 ops and 10 loops jumper code = 14 seconds fast jumper = 13 seconds This is a FULL 500% faster than the fastest above code. Note also that the "fast jumper" is only on par with the portable version. (The fast jumper modified the source-code to insert physical addresses instead of using indexes to the array). Two possible reasons for this. One is cache-related issues, but I highly doubt it because even the 50Meg (non-cachable) code was insanely fast (thanks mostly due to predictable pre-caching). The reduction of a branch is most likely the reason. Previously the instruction until a branch was like 2 so the pipeline couldn't be utilized. Note that more complex instructions wouldn't help; they'd just be more efficient. What strikes me as odd though is that I used a constant jump (goto start). The Pentium Pro line is supposed to be able to handle this efficiently. The next thing to test is having 100 core op-codes. Then it'll be interesting to see how horrible adding the conditionals are. Theoretically we don't need any conditionals. "change_state" op-codes can simply return NULL just like end does, thereby breaking out of the loop automatically. I never understood the reasoning behind while(*code) instead of just while(code) (which is properly handled here). Obviously there's value to while( code > start && code < end ), but that's in the non-hell-bent loop. Oh by the way. 500M ops in 15 seconds is 1.02 Billion parrot ops / second. This isn't too shabby on a mere 466MHZ machine. That's a scant 2.18 parrot ops per cycle. Getting pretty close to the theoretical max that architecture can perform. It ran even faster with the non-cached memory (200Meg sys-ram), than with short cachable bursts but that's mostly because the prefetcher was always correct. Note that this was only because there were only 3 assembly instructions per op-code, and all the op-codes were of the same form (inc, get-next-op, jump). Pure-mathematical real-parrot-ops should get pretty close to this, but most parrot-ops will have evil conditionals that thwart such performance. -Michael
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On Tuesday 09 October 2001 12:05 pm, Paolo Molaro wrote: > > It improved the speed of non-optimized code, because you didn't jump to > > the end of the switch simply to jump back to the loop conditional. But > > I didn't see any additional improvements with optimized code, because > > the optimizers take care of that for you. (Well, really, they put > > another copy of the while condition at the bottom.) > > Yes, that was basically the same mistake I did at first when testing > the goto label stuff, little or no improvement. But the 'right' way to do > it is to use the goto label construct not only instead of switch(), but > also instead of 'break;'. Oh, and the address array needs to be > declared const and static, too. > > const static void * goto_map[] = {&&OP1_LABEL, &&OP2_LABEL, ...}; > > ... > while (1) { > goto *goto_map [*ip]; // switch (*ip) > OP1_LABEL: > do_stuff (); > update_ip (); > goto *goto_map [*ip]; // break; > OP2_LABEL: > do_other_stuff (); > update_ip (); > goto *goto_map [*ip]; // break; > ... > } Ooh, I like that. I'll have to experiment. > > I've got three different opcode loops so far for Parrot. > > (Linux(x86)/gcc, Solaris(SPARC)/Forte, and Solaris(SPARC)/gcc). I've > > tried most every combination I can think of (am still working off the > > last couple, as a matter of fact). (Particularly ever since I received > > the inexplicable slowdown adding a default case.) Take nothing for > > granted, and try it all. I've posted some of my more ridiculous > > failures, and have posted what I have found to be my best numbers. > > Anyone is free to come up with a better, faster solution that meets the > > requirements. > > Thanks for your efforts, hard numbers are always better than talk! :-) > I was just trying to offer the experience I gathered working on similar > issues, hoping it can be useful. And yes, I was suggesting to change > a bit the requirements (not of the language, but of the current VM design) > more than proposing an implementation of it. That would be up to Dan, then. I may pitch small tweaks to the design (like reserving a set of opcodes strictly for as core internal), but I'm not confident enough to do more than that. (It's like Grandpa's Five Iron. (You replace the head, the shaft, and the grip, and still call it Grandpa's Five Iron.)) We are going to need to come up with a better strategy than a DO_OP macro, though. -- Bryan C. Warnock [EMAIL PROTECTED]
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
At 12:13 PM 10/9/2001 -0400, Michael Maraist wrote: >The key was &&lbl which returns the physical address of the a jumpable >label. This is highly gcc-specific, but given that we can use >#ifdef's I don't see a problem with it. :) No #ifdefs here. Instead the contents of the DO_OP macro should be dynamically generated at configure time. (Which is OK, since we need to eat all the .ops files at compile time anyway) The results look really interesting, though. I'm tempted to see how it works with potentially 16-bit opcodes. (Which is where noops would come in handy as padding to guarantee the 32-bit inlined integers were properly aligned) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On Tue, 9 Oct 2001, Dan Sugalski wrote: > At 01:20 PM 10/9/2001 +0200, Paolo Molaro wrote: > >On 10/07/01 Bryan C. Warnock wrote: > > > while (*pc) { > > > switch (*pc) { > > > } > > > } > If anyone wants an official ruling... > > DO_OP can contain any code you like as long as it: > > *) Some version ultimately compiles everywhere > *) Allows opcodes above a certain point (probably 256 or 512) to be > overridden at runtime > > Computed gotos, switches, function table dispatches, generated machine > code, or some other form of coding madness are all OK. I don't care which > way a particular platform goes as long as it doesn't constrain another > platform's ability to go another way. Hehe. Ok, I delved into the particulars of gcc and found: void runloop(int**code) { static void* array[] = { &&op_end, &&op1, &&op2 }; start: goto *array[*code]; op1: foo; goto start; op2: foo; goto start; op_end: return; } // end code The key was &&lbl which returns the physical address of the a jumpable label. This is highly gcc-specific, but given that we can use #ifdef's I don't see a problem with it. :) I even wrote an ugly version that avoided that array indirection. What it did was "convert" the code-base to change the op-codes into physical addresses. This obviously has some negative ramifications, but is definately doable. By removing that extra indirection I got a couple percent extra improvement (12%). I can't imagine you'll get much faster than that (except for an inlining jit or perl6->C compiler). The only problem was that the &&lbl's didn't want to be non-local, so I had to perform some evil magic. The unpolished code is attached with a cheesy instruction set which was just designed to prevent the optimizer from optimizing away my op-codes. Here are the benchmarks on a 466MHZ dual proc celeron: w/ code=500,000,000 and loop_cnt = 10 (cache-bound) gcc speeds.c - switcher time delta=45 caller time delta=47 jumper time delta=43 code compile time delta=4 fast jumper time delta=34 gcc -O speeds.c - switcher time delta=30 caller time delta=41 jumper time delta=28 code compile time delta=3 fast jumper time delta=26 gcc -O2 speeds.c --- switcher time delta=30 caller time delta=41 jumper time delta=28 code compile time delta=3 fast jumper time delta=25 gcc -O3 speeds.c switcher time delta=29 caller time delta=41 jumper time delta=28 code compile time delta=2 fast jumper time delta=26 == == w/ code=1000, loop_cnt = 5,000,000 gcc -O3 speeds.c -- switcher time delta=195 caller time delta=268 jumper time delta=185 code compile time delta=0 fast jumper time delta=154 # Note a marginal improvement from switch -> jump, yet an even bigger improvement in jump to fast-jump. It's unlikely to have a 50M (x4B+) code-base. But it's possible to have an inner loop that has 500,000 iterations. -Michael #include #include #include #include #include #include #include #include //#define SIZE 5000 #define SIZE 1000 int codep[SIZE]; void dump_mappings() { char strbuf[1024]; int pid = getpid(); sprintf( strbuf, "cat /proc/%i/maps", pid ); system( strbuf ); sprintf( strbuf, "ls -l /proc/%i/fd" ); system( strbuf ); printf("press enter to continue"); scanf("%c",strbuf); } int x,y,z; void switcher( int*code) { while(1) switch(*code) { case 0: goto DONE; break; case 1: x++; code += 1; break; case 2: y++; code += 1; break; case 3: z++; code += 1; break; case 4: x = y + z; code += 1; break; } DONE: } // end switcher int* end(int*code) { return 0; } int* o1(int*code) { x++; return code+1; } int* o2(int*code) { y++; return code+1; } int* o3(int*code) { z++; return code+1; } int* o4(int*code) { x = y + z; return code+1; } typedef int* (*op_handler_t)(int*code); op_handler_t op_codes[] = { end, o1, o2, o3, o4 }; void caller( int*code) { while(*code) code = op_codes[ *code ](code); } // end caller void gotoer( int*code) { static int* array[] = { &&lend, &&lo1, &&lo2, &&lo3, &&lo4 }; start: /* if ( !code ) { puts("null code"); exit(-1); } if ( *code > 3 ) { puts("Invalid idx"); exit(-1); } */ goto *array[ *code ]; lo1: x++; code += 1; goto start; lo2: y++; code += 1; goto start; lo3: z++; code += 1; goto start; lo4: z++; code += 1; goto start; lend: return; } // end gotoer int* convert( int size, int*code, int*map ) { int* retval, *base; int idx =0; //dump_mappings(); retval = base = (int*)malloc(size * sizeof(int)); if ( !retval ) { puts("out of memory"); exit(-1); } //dump_mappings(); //printf("retval=%p, codep=%p\n", retval, codep ); /* if (!retval) { puts("couldn't allo
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On 10/09/01 Bryan C. Warnock wrote: > > About the goto label feature discussed elsewhere, if the dispatch > > loop is emitted at compile time, there is no compatibility problem > > with non-gcc compilers, since we know what compiler we are going to use. > > I got speedups in the 10-20% range with dispatch-intensive benchmarks in > > mono. It can also be coded in a way similar to the switch code > > with a couple of defines, if needed, so that the same code compiles > > on both gcc and strict ANSI compilers. > > I also tested (previously; I need to hit it again) replacing the loop, > switch, and breaks with a lot of gotos and labels. > > LOOP: > /* function look up, if need be */ > switch (*pc) { > case (1) : { /* yada yada yada */; goto LOOP } > ... > } > > It improved the speed of non-optimized code, because you didn't jump to the > end of the switch simply to jump back to the loop conditional. But I didn't > see any additional improvements with optimized code, because the optimizers > take care of that for you. (Well, really, they put another copy of the > while condition at the bottom.) Yes, that was basically the same mistake I did at first when testing the goto label stuff, little or no improvement. But the 'right' way to do it is to use the goto label construct not only instead of switch(), but also instead of 'break;'. Oh, and the address array needs to be declared const and static, too. const static void * goto_map[] = {&&OP1_LABEL, &&OP2_LABEL, ...}; ... while (1) { goto *goto_map [*ip]; // switch (*ip) OP1_LABEL: do_stuff (); update_ip (); goto *goto_map [*ip]; // break; OP2_LABEL: do_other_stuff (); update_ip (); goto *goto_map [*ip]; // break; ... } > > The problem here is to make sure we really need the opcode swap > > functionality, it's really something that is going to kill > > dispatch performance. > > If a module wants to change the meaning of, eg the + operator, > > it can simply request the compiler to insert a call to a > > subroutine, instead of changing the meaning assigned to the > > VM opcode. The compiler is free to inline the sub, of course, > > just don't cripple the normal case with unnecessary overhead > > and let the special case pay the price of flexibility. > > Of course, if the special case is not so special, a _new_ > > opcode can be introduced, but there is really no reason to > > change the meaning of an opcode on the fly, IMHO. > > Comment, or flame, away. > > But how are you going to introduce the new opcode? Recompile Perl? Nope, this is done at the design and early beta stage: we decide which opcodes make sense and which not and we may add specialized opcodes if it makes sense to do so. Once the opcode set is defined, it can't be changed until the next release. > Unacceptable. We understand that from a classic language perspective, we're > slow and lumbering. We're Perl. We need that flexibilty. We're trying to > make that flexibility as fast as possible. Completely agree. My point is that I don't see a real reason to be flexible at the opcode level (changing the meaning of the opcodes) when you can have the same outcome with a cleaner design (set a different method in a class' vtable). Many things were done as opcodes in perl5 because calling subroutines is slow there, IIRC. Many of that features can simply be methods in some class in perl6: it's cleaner and it's flexible and it doesn't require to change the meaning of the opcodes. > I've got three different opcode loops so far for Parrot. (Linux(x86)/gcc, > Solaris(SPARC)/Forte, and Solaris(SPARC)/gcc). I've tried most every > combination I can think of (am still working off the last couple, as a > matter of fact). (Particularly ever since I received the inexplicable > slowdown adding a default case.) Take nothing for granted, and try it all. > I've posted some of my more ridiculous failures, and have posted what I have > found to be my best numbers. Anyone is free to come up with a better, > faster solution that meets the requirements. Thanks for your efforts, hard numbers are always better than talk! :-) I was just trying to offer the experience I gathered working on similar issues, hoping it can be useful. And yes, I was suggesting to change a bit the requirements (not of the language, but of the current VM design) more than proposing an implementation of it. lupus -- - [EMAIL PROTECTED] debian/rules [EMAIL PROTECTED] Monkeys do it better
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
At 10:51 PM 10/7/2001 -0400, Michael Fischer wrote: >Ok, this is a big one. Over 500 line diff, plus a new module >for parrot/Parrot. Thanks a lot for this. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
At 01:20 PM 10/9/2001 +0200, Paolo Molaro wrote: >On 10/07/01 Bryan C. Warnock wrote: > > while (*pc) { > > switch (*pc) { > > } > > } > >With the early mono interpreter I observed a 10% slowdown when I >checked that the instruction pointer was within the method's code: >that is not a check you need on every opcode dispatch, but only with >branches, so it makes sense to have a simple while (1) loop. Yeah, I really want to toss that as well for the normal case. Checking the validity of the instruction pointer's a good thing for the safe(ish) version of the interpreter loop, where we're explicitly asking for assurances. >About the goto label feature discussed elsewhere, if the dispatch >loop is emitted at compile time, there is no compatibility problem >with non-gcc compilers, since we know what compiler we are going to use. >I got speedups in the 10-20% range with dispatch-intensive benchmarks in >mono. It can also be coded in a way similar to the switch code >with a couple of defines, if needed, so that the same code compiles >on both gcc and strict ANSI compilers. If anyone wants an official ruling... DO_OP can contain any code you like as long as it: *) Some version ultimately compiles everywhere *) Allows opcodes above a certain point (probably 256 or 512) to be overridden at runtime Computed gotos, switches, function table dispatches, generated machine code, or some other form of coding madness are all OK. I don't care which way a particular platform goes as long as it doesn't constrain another platform's ability to go another way. > > seems simpler, but introduces some potential page (or at least i-cache(?)) > > thrashing, as you've got to do a significant jump just in order to jump > > again. The opcode comparison, followed by a small jump, behaves > much >nicer. FWIW (and I know this is out of order) the distance jumped really doesn't make any difference in the time it takes to jump. The real indicator if the time is what cache the destination is in. A 20 word jump will take longer than a 4M jump if the destination for the first is in main memory and the second in L1 cache. (L1 cache generally being loaded in 8, 16, or 32 byte chunks and L2 cache in 512 or 1024 byte chunks, but that's terribly processor-dependent) > > I've found [2] that the fastest solution (on the platforms I've tested) > are > > within the family: > > > > while (*pc) { > > if (*pc > CORE_OPCODE_NUMBER) { > > pc = func_table[*pc](); > > } else { > > switch (*pc) { > > } > > } >... but adds a comparison even for opcodes that don't need it. But it removes some uncertainty that the C compiler optimizer might be able to use to do better optimizations of the following switch. (Whether it *will* or not depends on the smarts built into the compiler) I can fully see this as a non-intuitive chunk of code that gets a performance boost. I'm really thinking that we're going to end up with benchmarked chunks of code keyed by OS and processor here, chosen by the configure process. I'm half-tempted to weld the benchmark runs into configure itself, but I'm leery of it choosing the wrong thing because some cron job fired off in the middle of its check. >The problem here is to make sure we really need the opcode swap >functionality, it's really something that is going to kill >dispatch performance. Dunno about kill, but it will have an impact. I'm fully aware of that, but lots of things in dynamic languages in general have performance impacts. Sometimes (and whether this is one of those times or not is up in the air) you just have to pay it. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On Tuesday 09 October 2001 07:20 am, Paolo Molaro wrote: > On 10/07/01 Bryan C. Warnock wrote: > > while (*pc) { > > switch (*pc) { > > } > > } > Should have been "while (pc)". Oops. > With the early mono interpreter I observed a 10% slowdown when I > checked that the instruction pointer was within the method's code: > that is not a check you need on every opcode dispatch, but only with > branches, so it makes sense to have a simple while (1) loop. Yes, it's amazing how fast "hell-bent" is. ;-) In my pre-Parrot testing, I did simply do a while (1). Since, I've used whatever was already coded. (Mostly.) However, I'd need a ruling from Dan on what is acceptable for the opcode dispatch loop. I've simply used the existing code to provide a baseline behavior. > > About the goto label feature discussed elsewhere, if the dispatch > loop is emitted at compile time, there is no compatibility problem > with non-gcc compilers, since we know what compiler we are going to use. > I got speedups in the 10-20% range with dispatch-intensive benchmarks in > mono. It can also be coded in a way similar to the switch code > with a couple of defines, if needed, so that the same code compiles > on both gcc and strict ANSI compilers. I also tested (previously; I need to hit it again) replacing the loop, switch, and breaks with a lot of gotos and labels. LOOP: /* function look up, if need be */ switch (*pc) { case (1) : { /* yada yada yada */; goto LOOP } ... } It improved the speed of non-optimized code, because you didn't jump to the end of the switch simply to jump back to the loop conditional. But I didn't see any additional improvements with optimized code, because the optimizers take care of that for you. (Well, really, they put another copy of the while condition at the bottom.) > > seems simpler, but introduces some potential page (or at least > > i-cache(?)) thrashing, as you've got to do a significant jump just in > > order to jump again. The opcode comparison, followed by a small jump, > > behaves much nicer. > > ... but adds a comparison even for opcodes that don't need it. > As with the check for the program counter, it's a check that > not all the opcodes need and as such should be left out of the > fast path. This means that the first 200 and something opcodes > are the most common ones _and_ the ones that need to be fast: > there is no point in avoiding a jump for calls to exit(), > read() etc (assuming those need opcodes at all). Well, you've got to represent them somehow. I haven't come up with any clever (or non-clever, for that matter) of doing that. We toyed with earlier of doing contextual switches - of changing the opcode loop itself as the context of the opcode stream changes. Part of that's been done already, with the trace and bounds checking. Currently, though, the opcode loops are too heavy to make that efficient. > > The problem here is to make sure we really need the opcode swap > functionality, it's really something that is going to kill > dispatch performance. > If a module wants to change the meaning of, eg the + operator, > it can simply request the compiler to insert a call to a > subroutine, instead of changing the meaning assigned to the > VM opcode. The compiler is free to inline the sub, of course, > just don't cripple the normal case with unnecessary overhead > and let the special case pay the price of flexibility. > Of course, if the special case is not so special, a _new_ > opcode can be introduced, but there is really no reason to > change the meaning of an opcode on the fly, IMHO. > Comment, or flame, away. But how are you going to introduce the new opcode? Recompile Perl? Unacceptable. We understand that from a classic language perspective, we're slow and lumbering. We're Perl. We need that flexibilty. We're trying to make that flexibility as fast as possible. I've got three different opcode loops so far for Parrot. (Linux(x86)/gcc, Solaris(SPARC)/Forte, and Solaris(SPARC)/gcc). I've tried most every combination I can think of (am still working off the last couple, as a matter of fact). (Particularly ever since I received the inexplicable slowdown adding a default case.) Take nothing for granted, and try it all. I've posted some of my more ridiculous failures, and have posted what I have found to be my best numbers. Anyone is free to come up with a better, faster solution that meets the requirements. -- Bryan C. Warnock [EMAIL PROTECTED]
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On 10/09/01 Benjamin Stuhl wrote: > Unfortunately, compiler tricks only work at compile time. > They're great for static languages like C++ or C#, but Perl > supports doing > > %CORE::GLOBAL::{'&print'} = \&myprint; > > at _runtime_. This is much to late to be going back and > patching up any occurences of print_p in the opstream, so > we need a level of indirection on every overridable opcode. > (Note that the _overloadable_ ones like the math routines > don't need that level of indirection - they get it by > vectoring through the PMC vtables). And the same is supposed to happen when you override print, since print is just a method in the Stream 'class' or whatever it's going to be called. If there is a 'print' opcode the implementation should do something like: case PRINT_OP: interp->current_stdout->vtable->print (...); and that will do the right thing anyway. lupus -- - [EMAIL PROTECTED] debian/rules [EMAIL PROTECTED] Monkeys do it better
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
--- Paolo Molaro <[EMAIL PROTECTED]> wrote: [snip, snip] > The problem here is to make sure we really need the > opcode swap > functionality, it's really something that is going to > kill > dispatch performance. > If a module wants to change the meaning of, eg the + > operator, > it can simply request the compiler to insert a call to a > subroutine, instead of changing the meaning assigned to > the > VM opcode. The compiler is free to inline the sub, of > course, > just don't cripple the normal case with unnecessary > overhead > and let the special case pay the price of flexibility. > Of course, if the special case is not so special, a _new_ > opcode can be introduced, but there is really no reason > to > change the meaning of an opcode on the fly, IMHO. > Comment, or flame, away. Unfortunately, compiler tricks only work at compile time. They're great for static languages like C++ or C#, but Perl supports doing %CORE::GLOBAL::{'&print'} = \&myprint; at _runtime_. This is much to late to be going back and patching up any occurences of print_p in the opstream, so we need a level of indirection on every overridable opcode. (Note that the _overloadable_ ones like the math routines don't need that level of indirection - they get it by vectoring through the PMC vtables). -- BKS __ Do You Yahoo!? NEW from Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month. http://geocities.yahoo.com/ps/info1
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On 10/07/01 Bryan C. Warnock wrote: > while (*pc) { > switch (*pc) { > } > } With the early mono interpreter I observed a 10% slowdown when I checked that the instruction pointer was within the method's code: that is not a check you need on every opcode dispatch, but only with branches, so it makes sense to have a simple while (1) loop. About the goto label feature discussed elsewhere, if the dispatch loop is emitted at compile time, there is no compatibility problem with non-gcc compilers, since we know what compiler we are going to use. I got speedups in the 10-20% range with dispatch-intensive benchmarks in mono. It can also be coded in a way similar to the switch code with a couple of defines, if needed, so that the same code compiles on both gcc and strict ANSI compilers. > I don't see (on simple inspection) a default case, which implies that all > functions would be in the switch. There's two problems with that. First, > you can't then swap out (or add) opcode functions, which compilation units > need to do. They're all fixed, unless you handle opcode differentiation > within each case. (And there some other problems associated with that, too, > but they're secondary.) Second, the switch will undoubtedly grow too large > to be efficient. A full switch with as few as 512 branches already shows > signs of performance degradation. [1] Thirdly, any function calls that you > then *do* have to make, come at the expense of the switching branch, on top > of normal function call overhead. > > I've found [2] that the fastest solution (on the platforms I've tested) are > within the family: > > while (*pc) { > if (*pc > CORE_OPCODE_NUMBER) { > pc = func_table[*pc](); > } else { > switch (*pc) { > } > } > > That keeps the switch branching small. Do this: > > while (*pc) { > switch (*pc) { > case : ... > default: pc = func_table[*pc](); > } > } > > seems simpler, but introduces some potential page (or at least i-cache(?)) > thrashing, as you've got to do a significant jump just in order to jump > again. The opcode comparison, followed by a small jump, behaves much nicer. ... but adds a comparison even for opcodes that don't need it. As with the check for the program counter, it's a check that not all the opcodes need and as such should be left out of the fast path. This means that the first 200 and something opcodes are the most common ones _and_ the ones that need to be fast: there is no point in avoiding a jump for calls to exit(), read() etc (assuming those need opcodes at all). The problem here is to make sure we really need the opcode swap functionality, it's really something that is going to kill dispatch performance. If a module wants to change the meaning of, eg the + operator, it can simply request the compiler to insert a call to a subroutine, instead of changing the meaning assigned to the VM opcode. The compiler is free to inline the sub, of course, just don't cripple the normal case with unnecessary overhead and let the special case pay the price of flexibility. Of course, if the special case is not so special, a _new_ opcode can be introduced, but there is really no reason to change the meaning of an opcode on the fly, IMHO. Comment, or flame, away. lupus -- - [EMAIL PROTECTED] debian/rules [EMAIL PROTECTED] Monkeys do it better
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
Oops, on-list again. On Monday 08 October 2001 12:12 am, Michael Fischer wrote: > > It is generated as a pass over basic_opcodes.ops, just as the function > definitions in basic_opcodes.c are done. I figured the philosophy was > "Don't hack the output, hack the template" if you want to change the > generated file. What swapping of opcodes do you imagine? I wouldn't have > thought that user code gets to muck with the asm offerings available. Every compilation unit (like a module, for instance) carries some or all of its own opcode table with it. These are switched out as the scope of the compilation unit changes. External code can and will define their own opcode functions, much in the same way that basic_opcodes.ops defines the default ones for the Parrot interpreter. However, to remove the necessity of opcode numbering coordination, the opcodes are arbitrarily numbered per compilation unit. So opcode 220 in Foo.pm may be a "leading 0" bit test, while opcode 220 in Bar.pm may be pipe creation. As the interpreter loads new code at runtime, it needs to be able to load and dispatch the new opcode. That's why within the current DO_OP loop, you dereference the current function table list from the interpreter structure. > > > Second, the switch will undoubtedly grow too large > > to be efficient. A full switch with as few as 512 branches already > > shows signs of performance degradation. [1] Thirdly, any function calls > > that you then *do* have to make, come at the expense of the switching > > branch, on top of normal function call overhead. > > How many asm opcodes do you imagine we'll have? I'm hardly expert here, > but I have a hard time believing we're going to have anything like 512. > Much of the remaining flexibility would be done in the vtable code. > Unless of course, _every_ new vtable function has to be re-written as > a single opcode... oh dear ... but now I'm getting confused. No. Each PMC opcode will in turn call a vtable op, which invariably will call one of our core ops. How many will there be? Perl 5 currently has over 300, and that doesn't include *anything* regex related (besides m// and s///), which Larry has already deemed will be handled as ops. I estimate that the core interpreter will provide over 800 individual opcodes. (Remember, everything is an opcode, from variable creation to scope deletion.) We can mitigate that by segmenting the opcodes into families. For instance, let's say the most efficent dispatch is via a switch no longer than 256 branches. Since regular expression ops are most likely going to occur mostly in conjunction with other regular expression ops, then we can switch to an alternate dispatch loop where the switch is populated with regex opcode handlers. (Assuming that they're invariant. See below.). Less frequently called opcodes, or opcodes that are already timely, can be relegated forever to function table lookup. > > As I understand it, all the function calls get inlined into the loop, > so what ones are going to be called inside it? Opcodes that are invariant, frequent, and quick. Invariant - Two integers add together only one way. I/O input may be done any number of ways. Frequent - basic math ops occur often. Exit does not. Quick - bit-wise manipulaton of integers should be quick.. Sleep doesn't need to be. The first of each are good candidates for inlining into the switch. The second are not. > > Sorry if I'm being dim and missing issues, I just wanted to get > this thing done, and I followed instructions. Perhaps needs/designs have > changed since instructions were given. Perhaps they have. The above is the last design direction that I'm aware of. > > Um, the latter's include file looks an awful lot like the switch > my patch generates. How does it work better? Not trying to be > antagonistic, I just don't see how it is any different/faster. The switch itself, yes. That's why I didn't point you to that. ;-) You solution should be as fast as the last set that I benchmarked. (Faster even, because the last tweak I made isn't in there, and I am doing the extra comparison.) But you will need to eventually address adding and overriding opcodes, which I don't believe your approach does well. (Which does make the difference.) -- Bryan C. Warnock [EMAIL PROTECTED]
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On Oct 07, "Bryan C. Warnock" <[EMAIL PROTECTED]> took up a keyboard and banged out > On Sunday 07 October 2001 10:51 pm, Michael Fischer wrote: > > > > Questions, comments, criticisms? > > It looks like you're implementing this as a full, solitary switch. > > while (*pc) { > switch (*pc) { > } > } Right, as per Simon and Dan's original suggestions to me. > > I don't see (on simple inspection) a default case, which implies that all > functions would be in the switch. There's two problems with that. First, > you can't then swap out (or add) opcode functions, which compilation units > need to do. They're all fixed, unless you handle opcode differentiation > within each case. (And there some other problems associated with that, too, > but they're secondary.) Um, yeah, I left out the issue of default, as I had no idea what to do with it ( fprintf(stderr, "Hey, non-opcode!"); exit( EXIT_FAILURE ); )??? Anyway, suggestions from the list welcome on that point. It is generated as a pass over basic_opcodes.ops, just as the function definitions in basic_opcodes.c are done. I figured the philosophy was "Don't hack the output, hack the template" if you want to change the generated file. What swapping of opcodes do you imagine? I wouldn't have thought that user code gets to muck with the asm offerings available. > Second, the switch will undoubtedly grow too large > to be efficient. A full switch with as few as 512 branches already shows > signs of performance degradation. [1] Thirdly, any function calls that you > then *do* have to make, come at the expense of the switching branch, on top > of normal function call overhead. How many asm opcodes do you imagine we'll have? I'm hardly expert here, but I have a hard time believing we're going to have anything like 512. Much of the remaining flexibility would be done in the vtable code. Unless of course, _every_ new vtable function has to be re-written as a single opcode... oh dear ... but now I'm getting confused. As I understand it, all the function calls get inlined into the loop, so what ones are going to be called inside it? Sorry if I'm being dim and missing issues, I just wanted to get this thing done, and I followed instructions. Perhaps needs/designs have changed since instructions were given. > > [1] Pre-Parrot testing at > http://members.home.net/bcwarno/Perl6/spool/opcode_test_results.txt > > [2] The code for my last set of benchmarks at > http://members.home.net/bcwarno/Perl6/spool/interpreter.c Um, the latter's include file looks an awful lot like the switch my patch generates. How does it work better? Not trying to be antagonistic, I just don't see how it is any different/faster. Clues welcomed. Cheers. Michael -- Michael Fischer 7.5 million years to run [EMAIL PROTECTED]printf "%d", 0x2a; -- deep thought
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On Sunday 07 October 2001 10:51 pm, Michael Fischer wrote: > > Questions, comments, criticisms? It looks like you're implementing this as a full, solitary switch. while (*pc) { switch (*pc) { } } I don't see (on simple inspection) a default case, which implies that all functions would be in the switch. There's two problems with that. First, you can't then swap out (or add) opcode functions, which compilation units need to do. They're all fixed, unless you handle opcode differentiation within each case. (And there some other problems associated with that, too, but they're secondary.) Second, the switch will undoubtedly grow too large to be efficient. A full switch with as few as 512 branches already shows signs of performance degradation. [1] Thirdly, any function calls that you then *do* have to make, come at the expense of the switching branch, on top of normal function call overhead. I've found [2] that the fastest solution (on the platforms I've tested) are within the family: while (*pc) { if (*pc > CORE_OPCODE_NUMBER) { pc = func_table[*pc](); } else { switch (*pc) { } } That keeps the switch branching small. Do this: while (*pc) { switch (*pc) { case : ... default: pc = func_table[*pc](); } } seems simpler, but introduces some potential page (or at least i-cache(?)) thrashing, as you've got to do a significant jump just in order to jump again. The opcode comparison, followed by a small jump, behaves much nicer. [1] Pre-Parrot testing at http://members.home.net/bcwarno/Perl6/spool/opcode_test_results.txt [2] The code for my last set of benchmarks at http://members.home.net/bcwarno/Perl6/spool/interpreter.c -- Bryan C. Warnock [EMAIL PROTECTED]
[PATCH] Big patch to have DO_OP as optional switch() statment
Ok, this is a big one. Over 500 line diff, plus a new module for parrot/Parrot. This patch adds a prompt to Configure.pl to allow the choice of DO_OP() as a function dereference (current behavior) or as a C switch() statement. Current up until Gregor applied 2 of Bryan's patches this evening. I hope they don't mess it up, as I haven't the energy tonight to do more. :-( This patchset gave the following, when run with options for function dereference ( the current mode ), and the switch() [ SuSE linux x86 750Mhz Athlon, 256MB RAM ] Normal: Iterations: 1 Start time: 1002505053 End time: 1002505082 Count: 1 Elapsed time:29 Estimated ops:3 Estimated ops (numerically):3.00 Elapsed time:29 Elapsed time:29.00 Ops/sec:10344827.586207 Switched: Iterations: 1 Start time: 1002507616 End time: 1002507639 Count: 1 Elapsed time:23 Estimated ops:3 Estimated ops (numerically):3.00 Elapsed time:23 Elapsed time:23.00 Ops/sec:13043478.260870 and courtesy of Mr. Schwern (debian on PowerPC, I believe) Normal: Iterations: 1 Start time: 1002507775 End time: 1002507838 Count: 1 Elapsed time:63 Estimated ops:3 Estimated ops (numerically):3.00 Elapsed time:63 Elapsed time:63.00 Ops/sec:4761904.761905 Switched: Iterations: 1 Start time: 1002508550 End time: 1002508596 Count: 1 Elapsed time:46 Estimated ops:3 Estimated ops (numerically):3.00 Elapsed time:46 Elapsed time:46.00 Ops/sec:6521739.130435 -- Questions, comments, criticisms? Cheers. -- Michael Fischer 7.5 million years to run [EMAIL PROTECTED]printf "%d", 0x2a; -- deep thought diff -ru parrot/Configure.pl parrot-switched/Configure.pl --- parrot/Configure.pl Thu Oct 4 16:19:38 2001 +++ parrot-switched/Configure.plSun Oct 7 21:59:27 2001 @@ -85,6 +85,7 @@ perl => $^X, debugging =>$opt_debugging, rm_f => 'rm -f', + do_op_t => 'switch', ); #copy the things from --define foo=bar @@ -108,6 +109,7 @@ prompt("How big would you like integers to be?", 'iv'); prompt("And your floats?", 'nv'); prompt("What is your native opcode type?", 'opcode_t'); +prompt("Opcode dispatch by switch or function ('switch' or 'func')", 'do_op_t'); unless( $c{debugging} ) { $c{ld_debug} = ' '; diff -ru parrot/MANIFEST parrot-switched/MANIFEST --- parrot/MANIFEST Sat Oct 6 08:41:57 2001 +++ parrot-switched/MANIFESTSun Oct 7 21:59:35 2001 @@ -14,6 +14,7 @@ Parrot/String.pm Parrot/Test.pm Parrot/Vtable.pm +Parrot/PPP.pm Test/More.pm Test/Simple.pm Test/Utils.pm diff -ru parrot/Makefile.in parrot-switched/Makefile.in --- parrot/Makefile.in Sun Oct 7 10:41:18 2001 +++ parrot-switched/Makefile.in Sun Oct 7 21:59:35 2001 @@ -18,6 +18,7 @@ PERL = ${perl} TEST_PROG = test_prog${exe} PDUMP = pdump${exe} +DO_OP_T = ${do_op_t} .c$(O): $(CC) $(CFLAGS) -o $@ -c $< @@ -31,7 +32,7 @@ $(TEST_PROG): test_main$(O) $(O_FILES) interp_guts$(O) op_info$(O) $(CC) $(CFLAGS) -o $(TEST_PROG) $(O_FILES) interp_guts$(O) op_info$(O) test_main$(O) $(C_LIBS) - + $(PDUMP): pdump$(O) packfile$(O) memory$(O) global_setup$(O) string$(O) strnative$(O) $(CC) $(CFLAGS) -o $(PDUMP) pdump$(O) packfile$(O) memory$(O) global_setup$(O) string$(O) strnative$(O) $(C_LIBS) @@ -44,7 +45,7 @@ strnative$(O): $(H_FILES) $(INC)/interp_guts.h interp_guts.c $(INC)/op_info.h op_info.c: opcode_table build_interp_starter.pl - $(PERL) build_interp_starter.pl + $(PERL) build_interp_starter.pl -t $(DO_OP_T) interpreter$(O): interpreter.c $(H_FILES) $(INC)/interp_guts.h @@ -59,7 +60,7 @@ basic_opcodes$(O): $(H_FILES) basic_opcodes.c basic_opcodes.c: basic_opcodes.ops process_opfunc.pl $(INC)/interp_guts.h - $(PERL) process_opfunc.pl basic_opcodes.ops + $(PERL) process_opfunc.pl > basic_opcodes.c $(INC)/op.h: opcode_table make_op_header.pl $(PERL) make_op_header.pl opcode_table > $(INC)/op.h Only in parrot-switched/Parrot: PPP.pm diff -ru parrot/basic_opcodes.ops parrot-switched/basic_opcodes.ops --- parrot/basic_opcodes.opsSun Oct 7 11:27:42 2001 +++ parrot-switched/basic_opcodes.ops Sun Oct 7 21:59:35 2001 @@ -1,11 +1,3 @@ -/* basic_opcodes.c - * - * Just some basic opcodes - * - */ - -#include "parrot/parrot.h" -#include /* SET Ix, CONSTANT */ AUTO_OP set_i_ic { diff -ru parrot/build_interp_starter.pl parrot-switched/build_interp_starter.pl --- parrot/build_interp_starter.pl Sun Oct 7 20:46:15 2001 +++ parrot-switched/build_interp_starter.pl Sun Oct 7 21:59:35 2001 @@ -7,6 +7,16 @@ use strict; use Parrot::Opcode; +use Parrot::PPP; + +use Getopt::Std; + +use vars qw($opt_t); +getopts('t:'); + +die "You