Am Sonntag, 6. August 2006 17:20 schrieb Bob Rogers: > The problem is that inferior runloops cannot be re-entered via > continuation. One example (out of many) is the delegate pmclass, which > uses inferior runloops in vtable methods to run bytecode. The resulting > call structure (mixing bytecode and C) looks like this: > > main runloop > => main bytecode > => op (e.g. set_s_p) > => vtable method > => inferior runloop > => method bytecode
The continuation barrier is only one nastyness of inferior runloops. The second problem with it is that it heavily influences the guts of garbage collection. Due to the involved C code with its auto-storage objects on the C-stack, we have to walk the C-stack for active objects. This is the reason that our GC system is termed 'conservative', but much worse, it effectively prevents timely destruction. See "Method Overloading and GC Issues" in Cfunc.pod. The only way IMHO to avoid this problem is to run GC at "safe" points at the runloop level, so that we don't have to trace any system areas (HW registers and C stack). This can be achieved by a) avoiding inferior runloops and b) triggering GC within some opcodes like C<new>, if there is resource shortage but not inside C code. I.e. if an allocation inside C code needs more objects, it would just set a flag "need_GC", allocate more objects and proceed. This would have also the advantage, that no pointers (e.g. str->strstart) would move under C code. Back to continuations. > 2. Eliminate inferior runloops. This is much harder, as it involves > rewriting much code, though the attached POC (see patch; more on this > below) shows that it is possible. The essential strategy is to split > the C code into two or more portions, the first containing the part that > sets up the bytecode call, and the rest deals with the result. Exactly. But the code splitting can be simplifed vastly IMHO. Your POC is creating an extra layer between opcodes and C code, which is basically doing two things: - manage to call user code on behalf of the C code and pass args to it: C<Parrot_op_set_reg_from_vtable> and C< C_continuation > stuff - pass return results back to the opcode: C<store_tail_result_into_register> The proposal below is dealing with these issues directly in the runloop. Basically all C code is called like this: if (info->op_func(interpreter, args)) { /* if it branched, goto new pc */ pc = args[n - 1]; } where C<op_func> is any C function following this calling convention. (It might be reasonable to also pass an argument signature or argument count, but this are implementation details). When now this opcode function is overloaded, it would be a stub that - invokes the PASM/PIR subroutine, which returns the new C<pc> and creates a return continuation - sets up current_args and current_results pointers Then the runloop would just dispatch to the PASM/PIR user code and run it w/o any inferior runloop. There's still the mentioned problem of calling user code deeply inside some called C function. E.g. calling C<get_string> from with C<parrot_pass_args> due to argument conversion. This can be handled by a combination of: a) roll it into the runloop e.g. print_p => set_s_p ; print_s b) disallow or restrict some overloading c) handle argument conversion at the runloop too The POC runloop below (np2.c) basically splits the runloop into 2 parts: a) fast inlined (switched) opcodes b) Cfunc calls, with argument setup performed in the runloop This argument setup could also be used for calling PASM/PIR code so that necessary argument conversions, which might trigger user code, are also executed in the runloop. The attached Cfunc.pod has plenty other reasons, why we might need to change C internals as well. The continuation barrier is just one of these. Comments welcome, leo
=head1 TITLE Calling C Functions =head1 STATUS Proposal, supersedes parts of F<interfaces.pod>. This document does not describe HL issues of interfaces, but rather the internals that might be needed to implement it. =head1 AUTHOR Leopold Toetsch =head1 ABSTRACT Parrot is calling C functions on behalf of user code in a vast variety: opcode and vtable functions, (multi-)subs and -methods. All these functions are called with differing calling schemes, which is causing a lot of issues. This document is covering mainly opcodes. The description of current state is also showing some problems with vtables and methods. But a generalization to vtable/methods and interfaces needs still more specifications from the namespace front. This proposal tries to describe current state and a possible solution. =head1 DESCRIPTION of STATE Classification of function types. =head2 Opcode Functions Parrot has currently +1200 opcodes. There are basically two types of opcodes: simple, easily-JITtable [1] (mostly number-related) opcodes, which have a representation at the hardware level. Second: opcodes that a) just call another function b) are too complex to deserve an opcode or c) things that better should be methods. =head3 a) just calling another function E.g. op find_cclass(out INT, in INT, in STR, in INT, in INT) calls the API C<Parrot_string_find_cclass> with the very same arguments and order of that function. The C<out> argument of the opcode is the return value of the called function. =head3 b) too complex E.g. op gcd(out INT, out INT, out INT, in INT, in INT) :advanced_math Please have a look at the generated code in the function runcore at F<src/ops/core_ops.c:11856 - 12114> (or any other runcore). (Line numbers may vary a bit due to code changes). =head3 c) better a method A lot of the C<String> or C<IO> opcodes should rather be methods. E.g. op split(out PMC, in STR, in STR) :base_core =head2 Sidenote: opcode permutations A major problem with the current scheme is opcode permutations. Due to variable and constant arguments, we get C<2^n> opcodes instead of just one. C<n> is the amount of incoming args that allow variable and constants. E.g. the user visible opcode (F<src/ops/math.ops>: B<sub>(out INT, in INT, in INT) actually is one of: sub_i_i_i # sub I0, I1, I2 sub_i_i_ic # sub I0, I1, 100 sub_i_ic_i # sub I0, 100, I2 sub_i_ic_ic # sub I0, 100, 10 -> set I0, 90 The last one doesn't exist in reality, it is precalculated at compile time by evaluation of C<sub_i_i_i> with the constants stuffed into registers. There are B<16> incarnations of the mentioned C<find_cclass> opcode B<per run core>. See e.g. F<src/ops/core_ops.c:10953 - 11079>. The switched, CGoto, and CGP runcores have also implementations of this I<one> opcode, so that we finally have B<64> variants that just all call the same function. This is all well hidden behind (perl) magic, if you don't look at the generated code. The C<*.ops> files are easy to maintain, but we should also be aware of the generated code. The JIT runtime doesn't and won't [2] implement such opcodes, it calls just the normal opcode function inside F<src/ops/core_ops.c>, which adds another call level to run the target function. =head2 Sidenote 2: Opcount permutations and predereferencing/JIT There are 2 implementation choices: =over 4 =item a) thread independent code The code is reusable by multiple threads in the interpeter, but therefore can not be dereferenced fully. A constant is dereferenced to the memory address of the constant, but a register is represented as an offset from the (per-(thread)interpreter) register base pointer. This produces slower code and suffers from opcode permutations. =item b) fully dereferenced Registers are also dereferenced to addresses, which is collapsing the code for constants and registers to just one case. The opcode permutation is gone, the runops core size is just 1/4th, but each thread needs another copy of the predereferenced or compiled opcode stream(s). =back Currently a) is implemented. JIT code generation could use either option at runtime with some adjustments. =head2 Opcode function complexity and slowness An opcode function is basically a multisub, which gets dispatched by the runcore depending on (type, type_const) per opcode argument that allows constants. The argument is then dereferenced according to it's type. But dereferencing arguments is done in each opcode, which while it's specialized to the type, is still fairly complex. Dereferencing an C<INTVAL> on x86 takes 3 CPU instructions, on PPC (or other CPUs that are lacking (base,index,scale) memory access it's 4 CPU instructions. This of course needs additional registers to hold intermediate results, which is causing extra pain on the register-starved x86 architecture. 5 more instructions are needed to preserve and restore non-volatile registers on the hardware stack. Here is a disassembly of the opcode function C<sub_i_i_i> (optimized compile of course): (gdb) disas Parrot_sub_i_i_i Dump of assembler code for function Parrot_sub_i_i_i: 0x080c9860 <Parrot_sub_i_i_i+0>: push %ebp 0x080c9861 <Parrot_sub_i_i_i+1>: mov %esp,%ebp 0x080c9863 <Parrot_sub_i_i_i+3>: sub $0x8,%esp 0x080c9866 <Parrot_sub_i_i_i+6>: mov %esi,(%esp) 0x080c9869 <Parrot_sub_i_i_i+9>: mov %edi,0x4(%esp) 0x080c986d <Parrot_sub_i_i_i+13>: mov 0x8(%ebp),%eax 0x080c9870 <Parrot_sub_i_i_i+16>: mov 0xc(%ebp),%edx 0x080c9873 <Parrot_sub_i_i_i+19>: mov 0xc(%eax),%esi 0x080c9876 <Parrot_sub_i_i_i+22>: mov 0x4(%edx),%ecx 0x080c9879 <Parrot_sub_i_i_i+25>: mov 0x8(%eax),%edx 0x080c987c <Parrot_sub_i_i_i+28>: mov 0x4(%eax),%edi 0x080c987f <Parrot_sub_i_i_i+31>: add $0x10,%eax 0x080c9882 <Parrot_sub_i_i_i+34>: mov (%ecx,%edx,4),%edx 0x080c9885 <Parrot_sub_i_i_i+37>: sub (%ecx,%esi,4),%edx 0x080c9888 <Parrot_sub_i_i_i+40>: mov %edx,(%ecx,%edi,4) 0x080c988b <Parrot_sub_i_i_i+43>: mov (%esp),%esi 0x080c988e <Parrot_sub_i_i_i+46>: mov 0x4(%esp),%edi 0x080c9892 <Parrot_sub_i_i_i+50>: mov %ebp,%esp 0x080c9894 <Parrot_sub_i_i_i+52>: pop %ebp 0x080c9895 <Parrot_sub_i_i_i+53>: ret We have: =over 4 =item * call ABI including register frame setup/tear down: 0, 1, 50-53 =item * register preservation, restore: 3-9, 43, 46 =item * argument dereferencing: 13-28, 34-40 =item * setting next pc: 31 =item * the actual work: 37 (partly the other is deref) =back As such opcode functions are also called by the JIT core for non-JITted or non-JITtable opcodes, the usually fastest runcore is also heaving a very big penality caused by these opcode functions. =head2 Opcode Roundup I'm since a long time in favor of just keeping simple opcodes. JIT/x86 has currently around 200 implemented opcodes, which is about the amount that is JITable. OTOH I'd like to have more opcodes to e.g. access int8, int16, int32, int64, float, and vectors of these. =head2 It sometimes doesn't even compile >One more note: be sure to compile Parrot optimized I can't. My dev machine's running gcc 2.95.4, and gcc throws lisp error messages compiling the switch core if I turn on optimizations. -- Dan As said, we currently have more than 1200 opcode variants. Continuing the approach of everything is an opcode (but see below) could end with opcode counts and runloop functions that are by far beyond C compiler optimization or even compilation limits. =head2 A Remark WRT Loadable Opcode Libraries A proposed scheme to circumvent the mentioned compiler troubles includes to load parts of the opcodes as shared libraries. This would of course reduce the core opcode count but doesn't change the overall problem: these loaded opcodes would still exhibit the permutation cost of operand addressing and - depending on implementation - some of the code multiplication by providing differing run core versions of the opcodes. Only the slowest run core (function based) can call these opcode functions with no speed loss, all other runcores need some extra code or would just call the loaded functions anyway, which renders the achievable opcode speed 'optimization' to nothing. This means that we can call a plain runtime library function at the same speed that a loaded opcode function would provide, just without the overhead of code permutation inherently present with current opcodes. =head2 VTABLE Functions Vtable functions are implementing the object and class behaviour of PMCs. Calling a vtable function is quite fast. I see several disadvantages though. =head3 Vtable size A I<vtable> provides all the builtin interface methods of all PMCs. When you look at F<src/pmc/integer.c> you see a lot of vtable methods, like C<Parrot_default_get_pmc_keyed>, which just call the default implementation, which then throws an exception. All PMCs have all vtable slots allocated independently of some usage for it. This takes a lot of memory, around 1 KByte (32-bit CPU) per class. There are a lot of utility PMCs like C<ParrotIO> that implement almost no functionality via the C<vtable>. The whole vtable is basically unused memory and just increases parrots memory footprint. =head3 Extendibility As the C<vtable> structure with all its slots is strictly defined at compile time, you can't just extend the vtable for one class or PMC to get more functionality. All PMCs and all classes will get that new slot. This means that we can't create yet unknown interfaces at runtime, which are based on vtables. =head3 Inheritance C<vtable> method inheritance is done at compile-time by the PMC compiler F<tools/build/pmc2c.pl>. Therefore you can't override any of these C<vtable> methods at runtime. You can create a new derived class, though, which inherits from a PMC, e.g. cl = subclass 'Integer', 'MyInt' and then override e.g. the C<MyInt.abs()> method: .namespace ['MyInt'] .sub __absolute :method debug('MyInt.abs() called') ... .end But this works only statically for already existing methods during creation of the C<MyInt> class. Multiple inheritance, like e.g. needed for the C<OrderedHash> PMC, doesn't work at all. =head3 Introspection Builtin vtables aren't visible with the C<can> opcode. .local pmc Int, MyInt # a .Integer and a 'MyInt' PMC Int = new .Integer MyInt = new 'MyInt' $I0 = can MyInt, '__absolute' # 1 $I1 = can Int, '__absolute' # 0 and you can't use method call syntax at all for vtables: $I1 = Int.'__absolute'() # Method '__absolute' not found There is also no metadata info about vtable function arguments or return value type (if any). =head3 Opcode and vtable permutations Vtable functions suffer partly by the same argument permutation problem as opcodes. From a HLL point of view a vtable function is part of some interface. E.g. The C<FixedPMCArray> PMC I<does> the I<array> interface. One part is to fetch an item from the array: elem = ar[idx] When we now have a look at vtable.tbl, we get: $ grep get_\.\*keyed vtable.tbl | wc -l 15 a long list of 15 C<vtable> functions that together build one part of the C<array> interface, namely 'fetch' an element. It's of course a nice optimization to be able to use native integers for the index or convenient to directly be able to fetch a C<STRING*> element out of the PMC array, but imagine now that Perl6 wants to create a tied array with a user defined C<FETCH> method. Overriding just C<get_pmc_keyed> would work, if all arguments are PMCs, but already fail, for a native integer key. That is a C<tie> implementation would have to override more or less all of these fetch methods. The same is true for the C<STORE> operation, which consists of another bunch of 15 vtable functions. Due to opcode permutation (constant vs variable) we need 28 opcodes for each of these 2 interface parts. =head3 Unimplemented vtable slots There is almost no support for C<STRING*> operations within vtables. Currently a typical opcode sequence to perform a C<STRING*> function on a PMC is: set S0, P0 # e.g. a .String PMC length I0, S0 # get (codepoint) length of String P0 new P1, .Integer set P1, I0 This becomes of course worse if more strings are involved: set S0, P0 set S1, P1 substr S2, S0, 1, 2, S1 new P2, .String set P2, S2 set P0, S0 That is six opcode dispatches for a plain C<String.replace()> functionality instead of just one. Also a lot of PMC I<vtable> functions are missing, e.g. a plain: q = p.get_numeric() # return a number PMC: $q = +$p; Only C<get_integer> or C<get_string> are in vtable, which return native C<I> or C<S> respectively. For completness, we'd still have to extend a vtable by a fair amount of slots, which only adds to the mentioned size and permutation problems. =head2 Builtin Methods To work around the inability of extending vtables, there is another calling scheme, implemented via the C<METHOD> syntax in F<.pmc>s. E.g. in F<src/pmc/string.pmc> METHOD PMC* lower() { ... } and we can use this like any other method from PASM/PIR: .local pmc s0, s1 ... s1 = s0."lower"() The drawback is that calling these methods works totally different, as these methods are created as NCI PMCs. There is no symmetry between vtable methods like C<Integer.__absolute()> and "real" methods like C<String.lower()>. =head2 Builtin Multi-Subs All the infix operations are calling (possibly builtin) multi subroutines. The infix operations are usable as opcode: add P0, P1, P2 which actually gets translated to infix .MMD_ADD, P0, P1, P2 It can be used in a function call: "__add"(l, r, d) or as a method: d = l."__add"(r) Multi subs are the most flexible way to implement a given functionality as a dedicated function is called that deals with the correct argument types already. =head2 Method Overloading and GC Issues Methods and vtables can be overriden with PASM/PIR code. This currently needs the invocation of an extra run loop. We then have a C stack layout like e.g. this (assuming the stack is growing downwards): switch_core() # e.g. when running -S ParrotObject_get_integer() # in ParrotObject 2) run_meth_fromc_args_reti() # 2) runops_args() # src/inter_run.c runops() runops_int() switch_core() # next run loop 2) Parrot_ prefix omitted, but you get the picture. With this C stack layout, we basically have two problems: =head3 The C stack isn't traced, if GC is called from the runloop A C<sweep> opcode issued at the runloop doesn't trace the C stack. Here is the reason why: =over =item Diagram Stack depth vs execution time C stack, time A C stack time B ------------------------------------------ run_core run_core [ Vtable function ] [ some function ] [ stack alignment ] [ another function ] local PMC variable A [ stack alignment ] local PMC variable B ---> time =item Stack alignment The ABIs of C<x86> or C<ppc> all define a stack alignment of 16 bytes. To keep that aligment the stack pointer is just decrement by some amount if needed. This can create 'holes' in the C stack, which aren't filled by any local variable. The memory just keeps it's old contents. =item Timely destruction and conservative GC Parrot has a conservative GC, which basically means: every memory location that looks like a pointer to an object keeps this object alive. If there is an INTVAL that holds the address of a PMC, that PMC is marked being alive. When now such a PMC needs timely destruction, the GC would find it alive, even if no real reference to that PMC is existing anymore, and timely destruction doesn't work anymore. =item All together Due to the stack alignment there is a fair (and fully unpredictable) chance that "old" memory looks like a PMC. In the diagram above, the C<PMC A> is still visible on the C stack at time C<B>. This leads to subtle bugs, which are almost impossible to track. It took me 3 days do even grep what's going on here and some more to fully analyze the problem (yes this bug did occur, and it's the reason for not tracing the C stack at the runloop level). =back =head3 Not tracing the C Stack is also wrong When there is a secondary run loop, all items on the stack up aren't traced, *if* GC is triggered by the C<sweep> opcode. This could lead to infant mortality F<docs/dev/infant.pod>, unless extra care is taken to anchor objects manually. This is of course error-prone as hell, because you don't always know, that somewhere down the stack another runloop will be started. =head2 MMD issues Currently the MMD lookup is fully dynamic, but is cached in a quadratic table for a type pair. Besides the size issues (the table is already 1 MByte just for Parrot core types) there is also another problem: the table holds pointers to C code and PMCs. The type of pointer is coded by setting the low bit of the pointer. But this hack fails miserably, if the architecture doesn't align C function at least at word boundaries or is using low bits for permissions (e.g. hppa). =head1 Function Summary =head2 Table of C Functions and Major Characteristics speed extend inherit introspect call scheme ----------------------------------------------------------- Opcode + 1) - + Opcode Vtable + - static - Vtable METHOD - 2) + + + NCI Multi - 2) + + + C 3) 1) Dynamic opcodes, but only usable as function. 2) Without opcode rewriting/caching 3) or NCI if a multisub is obtained by find_name =head2 Pros for current =over 4 =item * It works (more or less) and is implemented. =item * Some items are fixable. E.g. we could create an introspection add-on for vtables. =back =head2 Cons =over 4 =item * Code size =item * May hit hard compiler limits, by adding further opcode (permuations) =item * Code complexity due to 4 differing schemes =item * Fixing current issues needs still more code and adds complexity =item * Big memory footprint with bad cache locality =item * Implementing interfaces is complex, when it comes to mix different vtables and methods to create new functionality =back =head1 PROPOSAL =head2 Summary The 4 differing call schemes are replaced by just one, named B<CSub>. The calling conventions for a B<CSub> are standardized. =head2 B<CSub PMC> Well, there is already an existing F<src/pmc/csub.pmc> which is unused. I'm just reusing it. Anyway, a B<CSub PMC> provides at least these fields: function_ptr ... to the C function name ... function name function_signature ... kind of params/return value The C<CSub PMC> is replacing C<NCI> for calls inside the interpreter (to known functions). =head2 Calling conventions: B<csub_fun> All functions pointed to by a B<CSub> have the same function prototype: typedef int (*csub_fun) (Interp *, void **args); The C<args> array is defined to hold these items: args[0] ... address of return value, if function returns something args[0/1..n-1/n] ... n arguments the function is getting. These are either INTVALs or pointers to other types. args[n/n+1] ... pc of this instruction The return value is set to 1, if the function did change the C<pc>, i.e. it's causing a branch or 0 else. The C<pc> is used for functions that are called from the runloop and is usually just ignored. But it allows also to run arbitrary PASM/PIR code by a) create new register context b) create a return continuation c) update the C<pc> to point to the user code (a-c is exactly what C<Sub.invoke> is doing) and d) setup C<get_results> and C<set_args> pointers. Which means that any CSub called from the runloop can easily call a PASM/PIR implementation without starting an extra run loop. =head2 Opcode functions become C<csub_fun>'s Here is the same C<sub> opcode as a C<csub_fun>: (gdb) disas F_sub_i_i_i Dump of assembler code for function F_sub_i_i_i: 0x080484d0 <F_sub_i_i_i+0>: push %ebp 0x080484d1 <F_sub_i_i_i+1>: mov %esp,%ebp 0x080484d3 <F_sub_i_i_i+3>: mov 0xc(%ebp),%eax 0x080484d6 <F_sub_i_i_i+6>: mov 0x4(%eax),%edx 0x080484d9 <F_sub_i_i_i+9>: mov (%eax),%ecx 0x080484db <F_sub_i_i_i+11>: sub 0x8(%eax),%edx 0x080484de <F_sub_i_i_i+14>: xor %eax,%eax 0x080484e0 <F_sub_i_i_i+16>: mov %edx,(%ecx) 0x080484e2 <F_sub_i_i_i+18>: pop %ebp 0x080484e3 <F_sub_i_i_i+19>: ret It's obvious that getting the args is now as efficient as getting arguments from the stack. No extra non-volatile registers are used (and don't have to be preserved). The opcode count is halved, code size is even reduced more due to simpler (and faster) opcodes. Actually above assembler code is also (almost [3]) the same as generated code by C<src/pic_jit.c> i.e. a PASM/PIR subroutine compiled to hardware assembler code. This one function is now also the same for C<sub_i_i_ic> and C<sub_i_ic_i> and can therefore be emitted just once. This is reducing the code size of C<core_ops.c> to ~1/4th of current size. =head3 The runcore dereferences arguments The reduced code size in the opcode function comes of course at some price. Dereferencing the arguments per opcode is specialized per argument type. Doing it just once in the runloop needs a more general code that dereferences the arguments in a loop for all the opcodes and fills the args array. That is, we have a switch statement with argument types, very similar to argument passing to a PASM subroutine. This is slower in the general case of course. I've timed the C<examples/benchmarks/mops_intval.pasm>, which measures pure opcode dispatch speed: ./parrot -f 3.25 s ./np2 mops 10.4 s (np2.c is a proof of concept program, implementing the major ideas of opcode dispatch). But there is also: ./np2 mops 1.45 s Now what: is it 3 times slower or faster? The answer can be: both. The first and slow variant is using the general code all the time. The second timing is from partially inlining the most-used opcodes into a switch statement and not calling any function at all. That is the function runcore becomes a partially switched one, but the now smaller and faster opcode functions are still there, when the JIT core needs it. The C<mops> benchmark is of course just measuring pure opcode dispatch with very low overhead function bodies. More complex functionality in opcodes is by far not that much subject to dispatch speed. And a final remark: the JIT core is +300 times faster than C<parrot -f> for the I<MOPS> benchmark. Issues from above: =head3 a) just calling another function As the call signature of the opcode and the called function are the same, we can just call the function directly. Everything is an opcode (Dan S.) gets a revival, yeah. But in a much more general and useful sense: we get rid ot the dupliation due to argument permutation and additionally avoid an extra function call. =head3 b) too complex We can denote opcode functions with an extra attribute stating how it'll be implemented. There is already C<inline op> (which is currently just ignored). It would mean that a) all runcores implement it and b) the function body is also inlined in the switch statement of the function core. The opposite of that would be C<function op>, which means that the one and only incarnation of this opcode is in F<src/ops/core_ops.c> (or any other file) and other runcores just call this function. Opcodes with a plain C<op> have one function body in each runcore. "Opcode" for a C<function op> has the meaning: the C<op_info_t> structure provides the metadata for function arguments and a possible return value and there is one and only one C<csub_fun> that implements the function body. =head3 Prederefed runcores, loadable opcodes As opcode functions are much more light-weight with this calling scheme, we can call opcode functions from the function runcore with less penalty. If we are keeping interpreter independent code, we can denote opcodes with e.g. C<function op> and always call the function implementation instead of duplicating and permutating code in switched, direct threaded, and JIT core. The latter does exactly that anyway, if the architecture doesn't provide a JITted version of the opcode. =head2 Vtable functions and METHODS become C<CSubs> Most of the vtable variants and all the NCI methods would be replaced by B<CSub>s. The vtable would have some basic information about the PMC like C<name> or C<base_type>. The C<CSubs> are kept in the method or namespace hash of the PMC (basically an C<OrderedHash>), which permits indexed access too. E.g. PMC *csub = pmc->vtable->meth_hash_bucket_ptr[idx].value; branched = ((csub_fun) PMC_struct_val(csub))(interp, args); Or: branched = pmc->vtable->csubs[idx](interp, args); for functions kept directly in the vtable structure. =head2 Summary =head3 Proposal Pros =over 4 =item * The unification of the different call schemes of C functions and the usage of it as an opcode replacement would vastly simplify Parrot's internals. =item * The unification of calling schemes generalizes the concept of an opcode and unifies opcodes with other API functions inside the interpreter. =item * Moving the argument setup into the runloop vastly reduces the permutation count of current opcodes. The B<64> incarnations of C<find_cclass> boil down to just one, which actually is the called function itself. =item * Having just one calling scheme is much more friendly to further optimizations like creating JITted variants =item * The mentioned GC problem due to secondary runloop can be avoided by allowing all (PMC) opcodes to branch to another Sub. =item * The unification of vtable functions and methods will simplify the implementation of interfaces. =item * Less overhead for method calls (NCI is much more general) =back =head3 Cons =over 4 =item * Calling a B<CSub> as opcode needs an extra loop for argument setup =item * Calling a B<CSub> from Parrot C code is slightly less efficient than calling a plain C function. =item * It has to be implemented =back =head1 TODO Most of the todos aren't directly related to this proposal but need fixing/speccing anyway. =over 4 =item * Specify method calls to B<CSub>s =item * Static (class) methods vs functions =item * Namespace issues e.g. open P0, 'f' vs P0 = "open"('f') =item * Call signatures for multi-subs and -methods =back =head1 IMPLEMENTATION STEPS =head2 Proof of concept The example program C<np2.c> (see also F<docs/dev/nanoparrot.c>) implements the opcode dispatch and a limited set of argument setup for integer and some other signatures. It additionally uses a C<short> (16 bit) type for opcodes and is prepared for loading long integer constants from the constant table. =head2 Adjust code generation chain =over 4 =item * Annotate opcodes with C<inline> or C<function> =item * Call C<function op> from other runcores =back =head2 Interfaces (needs more specs still) =over 4 =item * Split vtable into vtable and CSsub parts =item * Patch PMC and opcode compilers to create CSubs =back =head2 Cleanup and removal of old code =head1 FOOTNOTES =over 4 =item [1] JIT The term C<JIT> in Parrot is technically misleading, when it comes to the sense 'Just in time (compilation)'. Parrot is predereferencing code just in time, when the instruction is executed first. This is actually a just in time recompilation to a faster opcode, especially, when a polymorhic inline cache (PIC) caches e.g. an hash lookup. Compilation to machinecode is still done per file (with the C<-j> switch) or now (very limited) per subroutine (with C<-Sj> or C<-Cj>. Nethertheless I'll use C<JIT> here in the sense of compiling to machine code. =item [2] JIT opcode coverage It takes about 0.2 - 0.5 hrs to properly implement and test just one JITted opcode for one architecture (if all goes well and you don't have to resort to debugging). See e.g. svn commit log from Feb, 16th r11554 - r11562, where I've implemented about 20 ops for the PPC architecture. Actually already existing templates have vastly simplified that job. Please open F<src/jit_cpu.c> in C<$EDITOR> and locate the line: Parrot_jit_fn_info_t op_jit All entries with C<1> in the C<extcall> fields are not implemented inside the JIT runtime and have to call a function inside F<src/ops/core_ops.c>. =item [3] PIC/JIT Below is the disassembly of the C<sub> function from: .sub main ... $I0 = 'sub'(i, j) ... .end .sub 'sub' .param int a .param int b $I0 = a - b .return ($I0) .end created by: $ gdb parrot (gdb) b parrot_build_asm (gdb) r -Cj sub.pir (gdb) fin (gdb) x /13i $1->arena.start 0x84b7d58: push %ebp 0x84b7d59: mov %esp,%ebp 0x84b7d5b: mov 0x10(%ebp),%eax 0x84b7d5e: mov 0x4(%eax),%edx 0x84b7d61: mov 0x8(%eax),%ecx 0x84b7d64: sub %ecx,%edx 0x84b7d66: mov 0x10(%ebp),%eax 0x84b7d69: mov (%eax),%eax 0x84b7d6b: mov %edx,(%eax) 0x84b7d6d: xor %eax,%eax 0x84b7d6f: mov %ebp,%esp 0x84b7d71: pop %ebp 0x84b7d72: ret The second fetching of the C<args> isn't needed. But besides of that it's the same code as I<gcc> is creting for a C<csub_fun>. As PIC/JIT is using the same calling conventions too, this creates a wide area of future improvment possibilities. A loadable opcode library can e.g. have the actual opcode bodies implemented as PIR subroutines that get compiled to native assembler code. We'd just need more types (char, float, vector, ...) in the PASM to be able to express the necessary transitions to hardware assembler code. =back =head1 SEE ALSO Please make sure to have a look at generated C<core*.c> files F<src/ops/core_ops.c>, F<src/ops/core_ops_switch.c> et al. This needs a built parrot of course. F<src/jit_cpu.c> and F<src/exec_cpu.c> are two more generated files for JIT and EXEC. The demo code is in F<np2.c>. =cut vim: expandtab sw=2 tw=70:
/* * - demonstrates how the interpreter interprets csub_fun opcodes * * - compile with: * * cc -o np2 -Wall np2.c -g -O2 && time ./np2 mops */ #include <stdlib.h> #include <stdio.h> #include <string.h> #include <assert.h> typedef int INTVAL; /* * while not strictly necessary, the program also shows how bytecode * size could be halfed (32-bit arch) by using a short opcode_t */ typedef short opcode_t; typedef double FLOATVAL; typedef void PMC; typedef void STRING; struct Interp_t; typedef int (*csub_fun)(struct Interp_t*, void**); /* * simplified register structure - it provides the same amount * of indirections to access a register, though */ #define NUM_REGISTERS 32 struct Reg { INTVAL int_reg; FLOATVAL num_reg; STRING *string_reg; PMC *pmc_reg; }; #define REG_INT(x) interpreter->bp[x].int_reg /* the constant table is simplified */ #define PCONST(x) interpreter->code->const_table[x] struct pf { opcode_t *byte_code; void **const_table; }; /* * op_info_t args, which is arg_type_t but includes also * a branch bit and the in/out bit */ typedef enum { ARG_I, /* 0b000000 - plain args */ ARG_S, /* 0b000001 */ ARG_P, /* 0b000010 */ ARG_N, /* 0b000011 */ ARG_IC, /* 0b000100 - in constants */ ARG_SC, /* 0b000101 */ ARG_PC, /* 0b000110 */ ARG_NC, /* 0b000111 */ ARG_IR, /* 0b001000 - out results */ ARG_SR, /* 0b001001 */ ARG_PR, /* 0b001010 */ ARG_NR, /* 0b001011 */ ARG_BRANCH, /* 0b001100 - opcode may branch */ ARG_ICC /* 0b001101 - long int constants in table */ } arg_types; struct op_info_t { const char *name; /* the distinct op_func_table of parrot - just having one * structure seems to be simpler */ csub_fun op_func; const arg_types argt[8]; char n_ops; }; struct Interp_t { struct Reg *bp; struct pf *code; struct op_info_t *op_info; int flags; }; typedef struct Interp_t Interp; /* * list of all opcodes */ #define OPCODES OP(end), OP(print_sc), OP(print_i), \ OP(set_i_ic), OP(set_i_icc), OP(if_i_ic), OP(sub_i_i_i), \ OP(MAX) #define OP(x) OP_ ## x typedef enum { OPCODES } opcodes; #undef OP #define ARG_MAX 6 /* * argument setup - fill the **args array */ static int set_args(Interp *interpreter, struct op_info_t *info, opcode_t *pc, void **args) { int i, n, idx; n = info->n_ops - 1; args[n] = pc++; for (i = 0; i < n; ++i) { idx = pc[i]; switch (info->argt[i]) { case ARG_IR: /* result - address of register */ args[i] = ®_INT(idx); break; case ARG_I: /* value - contents of registeer */ args[i] = (void*)REG_INT(idx); break; case ARG_IC: /* (small) constant - value of opcode */ args[i] = (void*)(idx); break; /* * a note WRT long integer constants: * currently we need sizeof(opcode_t) >= sizeof(INTVAL) * to be able to store INTVAL constants in the bytecode * * This will become a major PITA for 64 bit archs, as: * - the hardware does not provide 64-bit immediate * opcodes (not even amd64 - there is just one mov ins) * - it's causing huge bytecode size * * Therefore we might move integer constants or "big" * integer constants to the constant table too. * The PASM compiler can emit a special load instruction * (here set_i_icc) to fetch the constant. */ case ARG_ICC: args[i] = PCONST(idx); break; case ARG_SC: args[i] = PCONST(idx); break; case ARG_BRANCH: args[i] = (void*)(idx); break; default: assert("illegal arg" != NULL); } } return n + 1; } /* * function core runloop with partially inline switched opcodes */ static void run(Interp *interpreter, opcode_t *pc) { int n; void *args[ARG_MAX]; struct op_info_t *op_info, *info; opcode_t op; op_info = interpreter->op_info; while (pc) { op = *pc; #ifdef TRACE printf("PC %2d OP %d %s\n", pc - interpreter->code->byte_code, op, op_info[op].name); #endif info = op_info + *pc; /* inline a few I'd expect 50-100 inline ops */ switch (op) { case OP_sub_i_i_i: /* inline op */ REG_INT(pc[1]) = REG_INT(pc[2]) - REG_INT(pc[3]); pc += 4; break; case OP_if_i_ic: /* inline op */ if (REG_INT(pc[1])) { pc += (int)pc[2]; break; } pc += 3; break; default: /* func-only op */ n = set_args(interpreter, info, pc, args); pc += n; if (info->op_func(interpreter, args)) { /* if it branched, goto new pc */ pc = args[n - 1]; } } } } /* * nanoparrot is also implementing switched and cgoto runcores with * the help of these macros * In realiter the opcode compiler provides the necessary code * mangling */ # define CASE(function) \ static int \ F_ ## function (Interp *interpreter, void **args) { # define DISPATCH # define ENDRUN # define ENDDISPATCH # define NEXT return 0; } # define BRANCHED return 1; } /* * dispatch loop / opcode (function) bodies i.e. the .ops files * * core.ops et al */ DISPATCH CASE(end) args[0] = NULL; BRANCHED CASE(print_s) printf("%s", (char*)args[0]); NEXT CASE(print_i) printf("%d", (int)args[0]); NEXT CASE(set_i_i) /* there is just one set_i_ix functions */ *(int*)args[0] = (int)args[1]; NEXT CASE(if_i_i) if ((int)args[0]) { opcode_t *pc = args[2]; pc += (int)args[1]; args[2] = pc; return 1; } NEXT CASE(sub_i_i_i) *(int*)args[0] = (int)args[1] - (int)args[2]; NEXT CASE(MAX) printf("illegal opcode\n"); exit(1); NEXT ENDDISPATCH ENDRUN /* * opcode metadata, also generated by the opcode compiler */ static struct op_info_t op_info_init[] = { { "end", F_end, {0}, 1 }, { "print_sc", F_print_s, { ARG_SC }, 2 }, { "print_i", F_print_i, { ARG_I }, 2 }, { "set_i_ic", F_set_i_i, /* just one function set_i_i */ { ARG_IR, ARG_IC }, 3 }, { "set_i_icc", F_set_i_i, /* just one function set_i_i */ { ARG_IR, ARG_ICC }, 3 }, { "if_i_ic", F_if_i_i, { ARG_I, ARG_BRANCH }, 3 }, { "sub_i_i_i", F_sub_i_i_i, { ARG_IR, ARG_I, ARG_I }, 4 } }; static void init(Interp *interpreter, opcode_t *prog) { /* * create 1 register frame */ interpreter->bp = calloc(NUM_REGISTERS, sizeof(struct Reg)); /* * set opcode metadata */ interpreter->op_info = op_info_init; /* * the "packfile" */ interpreter->code = malloc(sizeof(struct pf)); interpreter->code->byte_code = prog; /* * create a simplified constant table */ #define N_CONSTS 5 interpreter->code->const_table = malloc(N_CONSTS * sizeof(char*)); interpreter->code->const_table[0] = "\n"; interpreter->code->const_table[1] = "done\n"; interpreter->code->const_table[2] = "error\n"; interpreter->code->const_table[3] = "usage: ./nanoparrot2 mops\n"; interpreter->code->const_table[4] = (void*) 100000000; } int main(int argc, char *argv[]) { opcode_t *prog; /* * the mops main loop */ opcode_t mops[] = { OP_set_i_icc, 4, 4, /* set I4, n */ OP_print_i, 4, /* print I4 */ OP_print_sc, 0, /* print "\n" */ OP_set_i_ic, 5, 1, /* set I5, 1 */ OP_sub_i_i_i, 4, 4, 5, /* L1: sub I4, I4, I5 */ OP_if_i_ic, 4, -4, /* if I4, L1 */ OP_print_sc, 1, /* print "done\n" */ OP_end /* end */ }; /* another program */ opcode_t usage[] = { OP_set_i_ic, 0, 2, /* set I0, 2 */ OP_if_i_ic, 0, 6, /* if I0, L1 # test branching */ OP_print_sc, 2, /* print "error\n" */ OP_end, /* end */ OP_print_sc, 3, /* L1: print "usage...\n" */ OP_end /* end */ }; Interp *interpreter = malloc(sizeof(Interp)); prog = usage; if (argc > 1) { if (!strcmp(argv[1], "mops")) prog = mops; } init(interpreter, prog); run(interpreter, prog); return 0; } /* * Local variables: * c-indentation-style: bsd * c-basic-offset: 4 * indent-tabs-mode: nil * End: * * vim: expandtab shiftwidth=4: */