Re: Continuations and inferior runloops: Analysis and POC

Leopold Toetsch Wed, 09 Aug 2006 05:01:24 -0700

Am Sonntag, 6. August 2006 17:20 schrieb Bob Rogers:
>    The problem is that inferior runloops cannot be re-entered via
> continuation.  One example (out of many) is the delegate pmclass, which
> uses inferior runloops in vtable methods to run bytecode.  The resulting
> call structure (mixing bytecode and C) looks like this:
>
>     main runloop
>         => main bytecode
>             => op (e.g. set_s_p)
>                 => vtable method
>                     => inferior runloop
>                         => method bytecode


The continuation barrier is only one nastyness of inferior runloops. The 
second problem with it is that it heavily influences the guts of garbage 
collection. Due to the involved C code with its auto-storage objects on the 
C-stack, we have to walk the C-stack for active objects. This is the reason 
that our GC system is termed 'conservative', but much worse, it effectively 
prevents timely destruction.

See "Method Overloading and GC Issues" in Cfunc.pod. The only way IMHO to 
avoid this problem is to run GC at "safe" points at the runloop level, so 
that we don't have to trace any system areas (HW registers and C stack). This 
can be achieved by a) avoiding inferior runloops and b) triggering GC within 
some opcodes like C<new>, if there is resource shortage but not inside C 
code. I.e. if an allocation inside C code needs more objects, it would just 
set a flag "need_GC", allocate more objects and proceed.

This would have also the advantage, that no pointers (e.g. str->strstart) 
would move under C code.

Back to continuations.

>    2.  Eliminate inferior runloops.  This is much harder, as it involves
> rewriting much code, though the attached POC (see patch; more on this
> below) shows that it is possible.  The essential strategy is to split
> the C code into two or more portions, the first containing the part that
> sets up the bytecode call, and the rest deals with the result.

Exactly. But the code splitting can be simplifed vastly IMHO. Your POC is 
creating an extra layer between opcodes and C code, which is basically doing 
two things:

- manage to call user code on behalf of the C code and pass args to it:
  C<Parrot_op_set_reg_from_vtable> and C< C_continuation > stuff
- pass return results back to the opcode:
  C<store_tail_result_into_register>

The proposal below is dealing with these issues directly in the runloop. 
Basically all C code is called like this:

    if (info->op_func(interpreter, args)) {
        /* if it branched, goto new pc */
        pc = args[n - 1];
    }

where C<op_func> is any C function following this calling convention. (It 
might be reasonable to also pass an argument signature or argument count, but 
this are implementation details).

When now this opcode function is overloaded, it would be a stub that
- invokes the PASM/PIR subroutine, which returns the new C<pc>
  and creates a return continuation
- sets up current_args and current_results pointers
Then the runloop would just dispatch to the PASM/PIR user code and run it w/o 
any inferior runloop.

There's still the mentioned problem of calling user code deeply inside some 
called C function. E.g. calling C<get_string> from with C<parrot_pass_args> 
due to argument conversion. This can be handled by a combination of:

a) roll it into the runloop e.g.
   print_p  => set_s_p ; print_s
b) disallow or restrict some overloading
c) handle argument conversion at the runloop too

The POC runloop below (np2.c) basically splits the runloop into 2 parts:
a) fast inlined (switched) opcodes
b) Cfunc calls, with argument setup performed in the runloop

This argument setup could also be used for calling PASM/PIR code so that 
necessary argument conversions, which might trigger user code, are also 
executed in the runloop.

The attached Cfunc.pod has plenty other reasons, why we might need to change C 
internals as well. The continuation barrier is just one of these.

Comments welcome,
leo

=head1 TITLE

Calling C Functions

=head1 STATUS

Proposal, supersedes parts of F<interfaces.pod>.

This document does not describe HL issues of interfaces, but rather
the internals that might be needed to implement it.

=head1 AUTHOR

Leopold Toetsch

=head1 ABSTRACT

Parrot is calling C functions on behalf of user code in a vast
variety: opcode and vtable functions, (multi-)subs and -methods. All
these functions are called with differing calling schemes, which is
causing a lot of issues. 

This document is covering mainly opcodes. The description of current
state is also showing some problems with vtables and methods. But a
generalization to vtable/methods and interfaces needs still more
specifications from the namespace front.

This proposal tries to describe current state and a possible solution.

=head1 DESCRIPTION of STATE

Classification of function types.

=head2 Opcode Functions

Parrot has currently +1200 opcodes. There are basically two types of
opcodes: simple, easily-JITtable [1] (mostly number-related) opcodes,
which have a representation at the hardware level. Second: opcodes
that a) just call another function b) are too complex to
deserve an opcode or c) things that better should be methods.

=head3 a) just calling another function

E.g.

  op find_cclass(out INT, in INT, in STR, in INT, in INT) 

calls the API C<Parrot_string_find_cclass> with the very same
arguments and order of that function. The C<out> argument of the
opcode is the return value of the called function.

=head3 b) too complex

E.g. 

  op gcd(out INT, out INT, out INT, in INT, in INT) :advanced_math

Please have a look at the generated code in the function runcore at
F<src/ops/core_ops.c:11856 - 12114> (or any other runcore). (Line
numbers may vary a bit due to code changes).

=head3 c) better a method

A lot of the C<String> or C<IO> opcodes should rather be methods. E.g.

  op split(out PMC, in STR, in STR) :base_core 

=head2 Sidenote: opcode permutations

A major problem with the current scheme is opcode permutations. Due to
variable and constant arguments, we get C<2^n> opcodes instead of just
one. C<n> is the amount of incoming args that allow variable and
constants. 

E.g. the user visible opcode (F<src/ops/math.ops>:

  B<sub>(out INT, in INT, in INT)

actually is one of:

  sub_i_i_i            # sub I0, I1, I2 
  sub_i_i_ic           # sub I0, I1, 100
  sub_i_ic_i           # sub I0, 100, I2 
  sub_i_ic_ic          # sub I0, 100, 10 -> set I0, 90

The last one doesn't exist in reality, it is precalculated at compile
time by evaluation of C<sub_i_i_i> with the constants stuffed into
registers.

There are B<16> incarnations of the mentioned C<find_cclass> opcode
B<per run core>. See e.g.  F<src/ops/core_ops.c:10953 - 11079>.  The
switched, CGoto, and CGP runcores have also implementations of this
I<one> opcode, so that we finally have B<64> variants that just all
call the same function. This is all well hidden behind (perl) magic,
if you don't look at the generated code. The C<*.ops> files are easy
to maintain, but we should also be aware of the generated code.

The JIT runtime doesn't and won't [2] implement such opcodes, it calls
just the normal opcode function inside F<src/ops/core_ops.c>, which
adds another call level to run the target function.

=head2 Sidenote 2: Opcount permutations and predereferencing/JIT

There are 2 implementation choices:

=over 4

=item a) thread independent code

The code is reusable by multiple threads in the interpeter, but
therefore can not be dereferenced fully. A constant is dereferenced to
the memory address of the constant, but a register is represented as
an offset from the (per-(thread)interpreter) register base pointer. This
produces slower code and suffers from opcode permutations. 

=item b) fully dereferenced

Registers are also dereferenced to addresses, which is collapsing the
code for constants and registers to just one case. The opcode
permutation is gone, the runops core size is just 1/4th, but each
thread needs another copy of the predereferenced or compiled opcode
stream(s).

=back

Currently a) is implemented. JIT code generation could use either
option at runtime with some adjustments.

=head2 Opcode function complexity and slowness

An opcode function is basically a multisub, which gets dispatched by
the runcore depending on (type, type_const) per opcode argument that
allows constants. The argument is then dereferenced according to it's
type. But dereferencing arguments is done in each opcode, which while
it's specialized to the type, is still fairly complex. 

Dereferencing an C<INTVAL> on x86 takes 3 CPU instructions, on PPC (or
other CPUs that are lacking (base,index,scale) memory access it's 4
CPU instructions.  This of course needs additional registers to hold
intermediate results, which is causing extra pain on the
register-starved x86 architecture. 5 more instructions are needed to
preserve and restore non-volatile registers on the hardware stack.
Here is a disassembly of the opcode function C<sub_i_i_i> (optimized
compile of course):

  (gdb) disas Parrot_sub_i_i_i
  Dump of assembler code for function Parrot_sub_i_i_i:
  0x080c9860 <Parrot_sub_i_i_i+0>:        push   %ebp
  0x080c9861 <Parrot_sub_i_i_i+1>:        mov    %esp,%ebp
  0x080c9863 <Parrot_sub_i_i_i+3>:        sub    $0x8,%esp
  0x080c9866 <Parrot_sub_i_i_i+6>:        mov    %esi,(%esp)
  0x080c9869 <Parrot_sub_i_i_i+9>:        mov    %edi,0x4(%esp)
  0x080c986d <Parrot_sub_i_i_i+13>:       mov    0x8(%ebp),%eax
  0x080c9870 <Parrot_sub_i_i_i+16>:       mov    0xc(%ebp),%edx
  0x080c9873 <Parrot_sub_i_i_i+19>:       mov    0xc(%eax),%esi
  0x080c9876 <Parrot_sub_i_i_i+22>:       mov    0x4(%edx),%ecx
  0x080c9879 <Parrot_sub_i_i_i+25>:       mov    0x8(%eax),%edx
  0x080c987c <Parrot_sub_i_i_i+28>:       mov    0x4(%eax),%edi
  0x080c987f <Parrot_sub_i_i_i+31>:       add    $0x10,%eax
  0x080c9882 <Parrot_sub_i_i_i+34>:       mov    (%ecx,%edx,4),%edx
  0x080c9885 <Parrot_sub_i_i_i+37>:       sub    (%ecx,%esi,4),%edx
  0x080c9888 <Parrot_sub_i_i_i+40>:       mov    %edx,(%ecx,%edi,4)
  0x080c988b <Parrot_sub_i_i_i+43>:       mov    (%esp),%esi
  0x080c988e <Parrot_sub_i_i_i+46>:       mov    0x4(%esp),%edi
  0x080c9892 <Parrot_sub_i_i_i+50>:       mov    %ebp,%esp
  0x080c9894 <Parrot_sub_i_i_i+52>:       pop    %ebp
  0x080c9895 <Parrot_sub_i_i_i+53>:       ret    

We have:

=over 4

=item * call ABI including register frame setup/tear down: 0, 1, 50-53

=item * register preservation, restore: 3-9, 43, 46

=item * argument dereferencing: 13-28, 34-40 

=item * setting next pc: 31

=item * the actual work: 37 (partly the other is deref)

=back
  
As such opcode functions are also called by the JIT core for
non-JITted or non-JITtable opcodes, the usually fastest runcore is
also heaving a very big penality caused by these opcode functions.

=head2 Opcode Roundup

I'm since a long time in favor of just keeping simple opcodes. JIT/x86
has currently around 200 implemented opcodes, which is about the
amount that is JITable. OTOH I'd like to have more opcodes to e.g.
access int8, int16, int32, int64, float, and vectors of these.

=head2 It sometimes doesn't even compile

  >One more note: be sure to compile Parrot optimized

  I can't. My dev machine's running gcc 2.95.4, and gcc throws lisp
  error messages compiling the switch core if I turn on optimizations.
  -- 
                                Dan 

As said, we currently have more than 1200 opcode variants. Continuing
the approach of everything is an opcode (but see below) could end with
opcode counts and runloop functions that are by far beyond C compiler
optimization or even compilation limits.

=head2 A Remark WRT Loadable Opcode Libraries

A proposed scheme to circumvent the mentioned compiler troubles
includes to load parts of the opcodes as shared libraries. This would
of course reduce the core opcode count but doesn't change the overall
problem: these loaded opcodes would still exhibit the permutation cost
of operand addressing and - depending on implementation - some of the
code multiplication by providing differing run core versions of the
opcodes. Only the slowest run core (function based) can call these
opcode functions with no speed loss, all other runcores need some
extra code or would just call the loaded functions anyway, which
renders the achievable opcode speed 'optimization' to nothing. 

This means that we can call a plain runtime library function at the
same speed that a loaded opcode function would provide, just without
the overhead of code permutation inherently present with current opcodes.

=head2 VTABLE Functions

Vtable functions are implementing the object and class behaviour of
PMCs. Calling a vtable function is quite fast. I see several
disadvantages though.

=head3 Vtable size

A I<vtable> provides all the builtin interface methods of all PMCs.
When you look at F<src/pmc/integer.c> you see a lot of vtable methods,
like C<Parrot_default_get_pmc_keyed>, which just call the default
implementation, which then throws an exception. All PMCs have all
vtable slots allocated independently of some usage for it. This takes a
lot of memory, around 1 KByte (32-bit CPU) per class.

There are a lot of utility PMCs like C<ParrotIO> that implement almost
no functionality via the C<vtable>. The whole vtable is basically
unused memory and just increases parrots memory footprint.

=head3 Extendibility

As the C<vtable> structure with all its slots is strictly defined at
compile time, you can't just extend the vtable for one class or PMC to
get more functionality. All PMCs and all classes will get that new
slot. This means that we can't create yet unknown interfaces at
runtime, which are based on vtables.

=head3 Inheritance

C<vtable> method inheritance is done at compile-time by the PMC
compiler F<tools/build/pmc2c.pl>. Therefore you can't override any of
these C<vtable> methods at runtime.

You can create a new derived class, though, which inherits from a PMC,
e.g.

  cl = subclass 'Integer', 'MyInt'

and then override e.g. the C<MyInt.abs()> method:

  .namespace ['MyInt']
  .sub __absolute :method
     debug('MyInt.abs() called')
     ...
  .end

But this works only statically for already existing methods during
creation of the C<MyInt> class.

Multiple inheritance, like e.g. needed for the C<OrderedHash> PMC,
doesn't work at all.

=head3 Introspection

Builtin vtables aren't visible with the C<can> opcode.

  .local pmc Int, MyInt             # a .Integer and a 'MyInt' PMC
  Int = new .Integer
  MyInt = new 'MyInt'
  $I0 = can MyInt, '__absolute'     # 1
  $I1 = can Int, '__absolute'       # 0

and you can't use method call syntax at all for vtables:

  $I1 = Int.'__absolute'()    # Method '__absolute' not found

There is also no metadata info about vtable function arguments or
return value type (if any). 

=head3 Opcode and vtable permutations

Vtable functions suffer partly by the same argument permutation
problem as opcodes. From a HLL point of view a vtable function is
part of some interface. E.g.

The C<FixedPMCArray> PMC I<does> the I<array> interface. One part is
to fetch an item from the array:

  elem = ar[idx]

When we now have a look at vtable.tbl, we get:

  $ grep get_\.\*keyed vtable.tbl  | wc -l
      15

a long list of 15 C<vtable> functions that together build one part of
the C<array> interface, namely 'fetch' an element. It's of course a
nice optimization to be able to use native integers for the index or
convenient to directly be able to fetch a C<STRING*> element out of
the PMC array, but imagine now that Perl6 wants to create a tied array
with a user defined C<FETCH> method. Overriding just C<get_pmc_keyed>
would work, if all arguments are PMCs, but already fail, for a native
integer key. That is a C<tie> implementation would have to override
more or less all of these fetch methods. The same is true for the
C<STORE> operation, which consists of another bunch of 15 vtable
functions.

Due to opcode permutation (constant vs variable) we need 28 opcodes for
each of these 2 interface parts.

=head3 Unimplemented vtable slots

There is almost no support for C<STRING*> operations within vtables.
Currently a typical opcode sequence to perform a C<STRING*> function
on a PMC is:

  set S0, P0      # e.g. a .String PMC
  length I0, S0   # get (codepoint) length of String P0
  new P1, .Integer
  set P1, I0

This becomes of course worse if more strings are involved:

  set S0, P0
  set S1, P1
  substr S2, S0, 1, 2, S1
  new P2, .String
  set P2, S2
  set P0, S0

That is six opcode dispatches for a plain C<String.replace()>
functionality instead of just one.

Also a lot of PMC I<vtable> functions are missing, e.g. a plain:

  q = p.get_numeric()      # return a number PMC: $q = +$p;

Only C<get_integer> or C<get_string> are in vtable, which return
native C<I> or C<S> respectively. For completness, we'd still have to
extend a vtable by a fair amount of slots, which only adds to the
mentioned size and permutation problems. 

=head2 Builtin Methods

To work around the inability of extending vtables, there is another
calling scheme, implemented via the C<METHOD> syntax in F<.pmc>s. E.g.
in F<src/pmc/string.pmc>

  METHOD PMC* lower() {
    ...
  }

and we can use this like any other method from PASM/PIR:

  .local pmc s0, s1
  ...
  s1 = s0."lower"()

The drawback is that calling these methods works totally different, as
these methods are created as NCI PMCs. There is no symmetry between
vtable methods like C<Integer.__absolute()> and "real" methods like
C<String.lower()>.

=head2 Builtin Multi-Subs

All the infix operations are calling (possibly builtin) multi
subroutines. The infix operations are usable as opcode:

  add P0, P1, P2

which actually gets translated to

  infix .MMD_ADD, P0, P1, P2

It can be used in a function call:

  "__add"(l, r, d)

or as a method:

  d = l."__add"(r)

Multi subs are the most flexible way to implement a given
functionality as a dedicated function is called that deals with the
correct argument types already.

=head2 Method Overloading and GC Issues

Methods and vtables can be overriden with PASM/PIR code. This
currently needs the invocation of an extra run loop. We then have a C
stack layout like e.g. this (assuming the stack is growing downwards):

  switch_core()                  # e.g. when running -S
  ParrotObject_get_integer()     # in ParrotObject 2)
  run_meth_fromc_args_reti()     # 2)
  runops_args()                  # src/inter_run.c
  runops()
  runops_int()
  switch_core()                  # next run loop

  2) Parrot_ prefix omitted, but you get the picture.

With this C stack layout, we basically have two problems:

=head3 The C stack isn't traced, if GC is called from the runloop

A C<sweep> opcode issued at the runloop doesn't trace the C stack.
Here is the reason why: 

=over

=item Diagram Stack depth vs execution time

  C stack, time A             C stack time B
  ------------------------------------------
  run_core                    run_core
  [ Vtable function ]         [ some function ]
  [ stack alignment  ]        [ another function ]        
  local PMC variable A        [ stack alignment ]
                              local PMC variable B

  ---> time    

=item Stack alignment

The ABIs of C<x86> or C<ppc> all define a stack alignment of 16 bytes.
To keep that aligment the stack pointer is just decrement by some
amount if needed. This can create 'holes' in the C stack, which aren't
filled by any local variable. The memory just keeps it's old contents.

=item Timely destruction and conservative GC

Parrot has a conservative GC, which basically means: every memory
location that looks like a pointer to an object keeps this object
alive. If there is an INTVAL that holds the address of a PMC, that PMC
is marked being alive.

When now such a PMC needs timely destruction, the GC would find it
alive, even if no real reference to that PMC is existing anymore, and
timely destruction doesn't work anymore.

=item All together

Due to the stack alignment there is a fair (and fully unpredictable)
chance that "old" memory looks like a PMC. In the diagram above, the
C<PMC A> is still visible on the C stack at time C<B>. This leads to
subtle bugs, which are almost impossible to track. It took me 3 days
do even grep what's going on here and some more to fully analyze the
problem (yes this bug did occur, and it's the reason for not tracing
the C stack at the runloop level).

=back

=head3 Not tracing the C Stack is also wrong   

When there is a secondary run loop, all items on the stack up aren't
traced, *if* GC is triggered by the C<sweep> opcode. This could lead
to infant mortality F<docs/dev/infant.pod>, unless extra care is taken
to anchor objects manually. This is of course error-prone as hell,
because you don't always know, that somewhere down the stack another
runloop will be started.

=head2 MMD issues

Currently the MMD lookup is fully dynamic, but is cached in a
quadratic table for a type pair. Besides the size issues (the table is
already 1 MByte just for Parrot core types) there is also another
problem: the table holds pointers to C code and PMCs. The type of
pointer is coded by setting the low bit of the pointer. But this hack
fails miserably, if the architecture doesn't align C function at least
at word boundaries or is using low bits for permissions (e.g. hppa).

=head1 Function Summary

=head2 Table of C Functions and Major Characteristics

          speed   extend    inherit  introspect   call scheme
  -----------------------------------------------------------
  Opcode    +        1)        -         +           Opcode   
  Vtable    +        -       static      -           Vtable 
  METHOD    - 2)     +         +         +           NCI 
  Multi     - 2)     +         +         +           C 3)

  1) Dynamic opcodes, but only usable as function.
  2) Without opcode rewriting/caching
  3) or NCI if a multisub is obtained by find_name

=head2 Pros for current

=over 4

=item * It works (more or less) and is implemented.

=item * Some items are fixable. E.g. we could create an introspection
add-on for vtables.

=back

=head2 Cons

=over 4

=item * Code size

=item * May hit hard compiler limits, by adding further opcode
(permuations)

=item * Code complexity due to 4 differing schemes

=item * Fixing current issues needs still more code and adds
complexity

=item * Big memory footprint with bad cache locality

=item * Implementing interfaces is complex, when it comes to mix
different vtables and methods to create new functionality

=back

=head1 PROPOSAL

=head2 Summary

The 4 differing call schemes are replaced by just one, named B<CSub>.
The calling conventions for a B<CSub> are standardized.

=head2 B<CSub PMC>

Well, there is already an existing F<src/pmc/csub.pmc> which is
unused. I'm just reusing it. Anyway, a B<CSub PMC> provides at least
these fields:

   function_ptr        ... to the C function
   name                ... function name
   function_signature  ... kind of params/return value 

The C<CSub PMC> is replacing C<NCI> for calls inside the interpreter
(to known functions). 

=head2 Calling conventions: B<csub_fun>

All functions pointed to by a B<CSub> have the same function
prototype:

  typedef int (*csub_fun) (Interp *, void **args);

The C<args> array is defined to hold these items:

  args[0]       ... address of return value, if function returns 
                    something
  args[0/1..n-1/n]  ... n arguments the function is getting. These are 
                    either INTVALs or pointers to other types.
  args[n/n+1]   ... pc of this instruction 

The return value is set to 1, if the function did change the C<pc>,
i.e. it's causing a branch or 0 else.

The C<pc> is used for functions that are called from the runloop and
is usually just ignored. But it allows also to run arbitrary PASM/PIR
code by a) create new register context b) create a return continuation
c) update the C<pc> to point to the user code (a-c is exactly what
C<Sub.invoke> is doing) and d) setup C<get_results> and C<set_args>
pointers. Which means that any CSub called from the runloop can easily
call a PASM/PIR implementation without starting an extra run loop.

=head2 Opcode functions become C<csub_fun>'s

Here is the same C<sub> opcode as a C<csub_fun>:

  (gdb) disas F_sub_i_i_i
  Dump of assembler code for function F_sub_i_i_i:
  0x080484d0 <F_sub_i_i_i+0>:     push   %ebp
  0x080484d1 <F_sub_i_i_i+1>:     mov    %esp,%ebp
  0x080484d3 <F_sub_i_i_i+3>:     mov    0xc(%ebp),%eax
  0x080484d6 <F_sub_i_i_i+6>:     mov    0x4(%eax),%edx
  0x080484d9 <F_sub_i_i_i+9>:     mov    (%eax),%ecx
  0x080484db <F_sub_i_i_i+11>:    sub    0x8(%eax),%edx
  0x080484de <F_sub_i_i_i+14>:    xor    %eax,%eax
  0x080484e0 <F_sub_i_i_i+16>:    mov    %edx,(%ecx)
  0x080484e2 <F_sub_i_i_i+18>:    pop    %ebp
  0x080484e3 <F_sub_i_i_i+19>:    ret    

It's obvious that getting the args is now as efficient as getting
arguments from the stack. No extra non-volatile registers are used
(and don't have to be preserved). The opcode count is halved, code
size is even reduced more due to simpler (and faster) opcodes.

Actually above assembler code is also (almost [3]) the same as
generated code by C<src/pic_jit.c> i.e. a PASM/PIR subroutine compiled
to hardware assembler code.

This one function is now also the same for C<sub_i_i_ic> and
C<sub_i_ic_i> and can therefore be emitted just once. This is reducing
the code size of C<core_ops.c> to ~1/4th of current size.

=head3 The runcore dereferences arguments

The reduced code size in the opcode function comes of course at some
price. Dereferencing the arguments per opcode is specialized per
argument type. Doing it just once in the runloop needs a more general
code that dereferences the arguments in a loop for all the opcodes and
fills the args array.

That is, we have a switch statement with argument types, very similar
to argument passing to a PASM subroutine. This is slower in the
general case of course. I've timed the
C<examples/benchmarks/mops_intval.pasm>, which measures pure opcode
dispatch speed:

 ./parrot -f    3.25 s
 ./np2 mops    10.4  s 

(np2.c is a proof of concept program, implementing the major ideas of
opcode dispatch). But there is also:

 ./np2 mops     1.45 s 

Now what: is it 3 times slower or faster? The answer can be: both. The
first and slow variant is using the general code all the time. The
second timing is from partially inlining the most-used opcodes into
a switch statement and not calling any function at all. That is the
function runcore becomes a partially switched one, but the now smaller
and faster opcode functions are still there, when the JIT core needs
it.

The C<mops> benchmark is of course just measuring pure opcode dispatch
with very low overhead function bodies. More complex functionality in
opcodes is by far not that much subject to dispatch speed. And a final
remark: the JIT core is +300 times faster than C<parrot -f> for the
I<MOPS> benchmark.

Issues from above:

=head3 a) just calling another function

As the call signature of the opcode and the called function are the
same, we can just call the function directly. Everything is an opcode
(Dan S.) gets a revival, yeah. But in a much more general and useful
sense: we get rid ot the dupliation due to argument permutation and
additionally avoid an extra function call.

=head3 b) too complex

We can denote opcode functions with an extra attribute stating how
it'll be implemented. There is already C<inline op> (which is
currently just ignored). It would mean that a) all runcores implement
it and b) the function body is also inlined in the switch statement of
the function core. 

The opposite of that would be C<function op>, which means that the one
and only incarnation of this opcode is in F<src/ops/core_ops.c> (or
any other file) and other runcores just call this function.  Opcodes
with a plain C<op> have one function body in each runcore. "Opcode" for
a C<function op> has the meaning: the C<op_info_t> structure provides
the metadata for function arguments and a possible return value and
there is one and only one C<csub_fun> that implements the function body.

=head3 Prederefed runcores, loadable opcodes

As opcode functions are much more light-weight with this calling
scheme, we can call opcode functions from the function runcore with
less penalty. If we are keeping interpreter independent code, we can
denote opcodes with e.g. C<function op> and always call the function
implementation instead of duplicating and permutating code in
switched, direct threaded, and JIT core. The latter does exactly that
anyway, if the architecture doesn't provide a JITted version of the
opcode.

=head2 Vtable functions and METHODS become C<CSubs>

Most of the vtable variants and all the NCI methods would be replaced
by B<CSub>s. The vtable would have some basic information about the
PMC like C<name> or C<base_type>. The C<CSubs> are kept in the method
or namespace hash of the PMC (basically an C<OrderedHash>), which
permits indexed access too. E.g.

  PMC *csub =  pmc->vtable->meth_hash_bucket_ptr[idx].value;
  branched = ((csub_fun) PMC_struct_val(csub))(interp, args);

Or:

  branched = pmc->vtable->csubs[idx](interp, args);

for functions kept directly in the vtable structure.

=head2 Summary

=head3 Proposal Pros

=over 4

=item * The unification of the different call schemes of C functions and the
usage of it as an opcode replacement would vastly simplify Parrot's
internals.

=item * The unification of calling schemes generalizes the concept of
an opcode and unifies opcodes with other API functions inside the
interpreter.

=item * Moving the argument setup into the runloop vastly reduces the
permutation count of current opcodes. The B<64> incarnations of
C<find_cclass> boil down to just one, which actually is the
called function itself.

=item * Having just one calling scheme is much more friendly to
further optimizations like creating JITted variants

=item * The mentioned GC problem due to secondary runloop can be
avoided by allowing all (PMC) opcodes to branch to another Sub.

=item * The unification of vtable functions and methods will simplify
the implementation of interfaces.

=item * Less overhead for method calls (NCI is much more general)

=back

=head3 Cons

=over 4

=item * Calling a B<CSub> as opcode needs an extra loop for argument setup

=item * Calling a B<CSub> from Parrot C code is slightly less efficient than
calling a plain C function.

=item * It has to be implemented

=back

=head1 TODO

Most of the todos aren't directly related to this proposal but need
fixing/speccing anyway.

=over 4

=item * Specify method calls to B<CSub>s

=item * Static (class) methods vs functions

=item * Namespace issues e.g. open P0, 'f'  vs  P0 = "open"('f')

=item * Call signatures for multi-subs and -methods

=back

=head1 IMPLEMENTATION STEPS

=head2 Proof of concept

The example program C<np2.c> (see also F<docs/dev/nanoparrot.c>)
implements the opcode dispatch and a limited set of argument setup for
integer and some other signatures. It additionally uses a C<short>
(16 bit) type for opcodes and is prepared for loading long integer
constants from the constant table.

=head2 Adjust code generation chain

=over 4

=item * Annotate opcodes with C<inline> or C<function>

=item * Call C<function op> from other runcores

=back

=head2 Interfaces (needs more specs still)

=over 4

=item * Split vtable into vtable and CSsub parts

=item * Patch PMC and opcode compilers to create CSubs

=back

=head2 Cleanup and removal of old code

=head1 FOOTNOTES

=over 4

=item [1] JIT

The term C<JIT> in Parrot is technically misleading, when it comes to
the sense 'Just in time (compilation)'. Parrot is predereferencing
code just in time, when the instruction is executed first. This is
actually a just in time recompilation to a faster opcode, especially,
when a polymorhic inline cache (PIC) caches e.g. an hash lookup.

Compilation to machinecode is still done per file (with the C<-j>
switch) or now (very limited) per subroutine (with C<-Sj> or C<-Cj>.
Nethertheless I'll use C<JIT> here in the sense of compiling to
machine code.

=item [2] JIT opcode coverage

It takes about 0.2 - 0.5 hrs to properly implement and test just one
JITted opcode for one architecture (if all goes well and you don't
have to resort to debugging). See e.g. svn commit log from Feb, 16th
r11554 - r11562, where I've implemented about 20 ops for the PPC
architecture. Actually already existing templates have vastly
simplified that job. 

Please open F<src/jit_cpu.c> in C<$EDITOR> and locate the line:

  Parrot_jit_fn_info_t op_jit

All entries with C<1> in the C<extcall> fields are not implemented
inside the JIT runtime and have to call a function inside
F<src/ops/core_ops.c>.

=item [3] PIC/JIT

Below is the disassembly of the C<sub> function from:
  
  .sub main
    ...
    $I0 = 'sub'(i, j) 
    ...
  .end
  .sub 'sub'
    .param int a
    .param int b
    $I0 = a - b
    .return ($I0)
  .end

created by:
  
  $ gdb parrot
  (gdb) b parrot_build_asm
  (gdb) r -Cj sub.pir
  (gdb) fin
  (gdb) x /13i $1->arena.start
  0x84b7d58:      push   %ebp
  0x84b7d59:      mov    %esp,%ebp
  0x84b7d5b:      mov    0x10(%ebp),%eax
  0x84b7d5e:      mov    0x4(%eax),%edx
  0x84b7d61:      mov    0x8(%eax),%ecx
  0x84b7d64:      sub    %ecx,%edx
  0x84b7d66:      mov    0x10(%ebp),%eax
  0x84b7d69:      mov    (%eax),%eax
  0x84b7d6b:      mov    %edx,(%eax)
  0x84b7d6d:      xor    %eax,%eax
  0x84b7d6f:      mov    %ebp,%esp
  0x84b7d71:      pop    %ebp
  0x84b7d72:      ret

The second fetching of the C<args> isn't needed. But besides of that
it's the same code as I<gcc> is creting for a C<csub_fun>.

As PIC/JIT is using the same calling conventions too, this creates a
wide area of future improvment possibilities. A loadable opcode
library can e.g. have the actual opcode bodies implemented as PIR
subroutines that get compiled to native assembler code. We'd just need
more types (char, float, vector, ...) in the PASM to be able to
express the necessary transitions to hardware assembler code.

=back

=head1 SEE ALSO

Please make sure to have a look at generated C<core*.c> files
F<src/ops/core_ops.c>, F<src/ops/core_ops_switch.c> et al. This needs
a built parrot of course. F<src/jit_cpu.c> and F<src/exec_cpu.c> are
two more generated files for JIT and EXEC.

The demo code is in F<np2.c>.

=cut

vim: expandtab sw=2 tw=70:

/*
 * - demonstrates how the interpreter interprets csub_fun opcodes
 *
 * - compile with:
 *
 * cc -o np2 -Wall np2.c -g -O2 && time ./np2 mops
 */

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>

typedef int INTVAL;
/*
 * while not strictly necessary, the program also shows how bytecode
 * size could be halfed (32-bit arch) by using a short opcode_t
 */
typedef short opcode_t;
typedef double FLOATVAL;
typedef void PMC;
typedef void STRING;

struct Interp_t;

typedef int (*csub_fun)(struct Interp_t*, void**);

/*
 * simplified register structure - it provides the same amount 
 * of indirections to access a register, though
 */

#define NUM_REGISTERS 32

struct Reg {
    INTVAL int_reg;
    FLOATVAL num_reg;
    STRING *string_reg;
    PMC *pmc_reg;
};

#define REG_INT(x) interpreter->bp[x].int_reg

/* the constant table is simplified */
#define PCONST(x) interpreter->code->const_table[x]

struct pf {
    opcode_t *byte_code;
    void **const_table;
};

/*
 * op_info_t args, which is arg_type_t but includes also
 * a branch bit and the in/out bit
 */
typedef enum {
    ARG_I,                    /* 0b000000 - plain args */
    ARG_S,                    /* 0b000001 */ 
    ARG_P,                    /* 0b000010 */ 
    ARG_N,                    /* 0b000011 */ 
    ARG_IC,                   /* 0b000100 - in constants */
    ARG_SC,                   /* 0b000101 */ 
    ARG_PC,                   /* 0b000110 */ 
    ARG_NC,                   /* 0b000111 */ 
    ARG_IR,                   /* 0b001000 - out results */
    ARG_SR,                   /* 0b001001 */ 
    ARG_PR,                   /* 0b001010 */ 
    ARG_NR,                   /* 0b001011 */ 
    ARG_BRANCH,               /* 0b001100 - opcode may branch */
    ARG_ICC                   /* 0b001101 - long int constants in table */
} arg_types;

struct op_info_t {
    const char *name;
    /* the distinct op_func_table of parrot - just having one
     * structure seems to be simpler
     */
    csub_fun op_func;
    const arg_types argt[8];
    char n_ops;
};

struct Interp_t {
    struct Reg *bp;
    struct pf *code;
    struct op_info_t *op_info;
    int flags;
};

typedef struct Interp_t Interp;

/*
 * list of all opcodes
 */

#define OPCODES OP(end),      OP(print_sc), OP(print_i),  \
                OP(set_i_ic), OP(set_i_icc), OP(if_i_ic),  OP(sub_i_i_i), \
                OP(MAX)


#define OP(x) OP_ ## x
typedef enum { OPCODES } opcodes;
#undef OP

#define ARG_MAX   6

/*
 * argument setup - fill the **args array
 */
static int
set_args(Interp *interpreter, struct op_info_t *info, opcode_t *pc, void **args)
{
    int i, n, idx;

    n = info->n_ops - 1;
    args[n] = pc++;
    for (i = 0; i < n; ++i) {
        idx  = pc[i];
        switch (info->argt[i]) {
            case ARG_IR:  /* result - address of register */
                args[i] = &REG_INT(idx);
                break;
            case ARG_I:   /* value - contents of registeer */ 
                args[i] = (void*)REG_INT(idx);
                break;
            case ARG_IC:  /* (small) constant - value of opcode */
                args[i] = (void*)(idx);
                break;
            /*
             * a note WRT long integer constants:
             * currently we need sizeof(opcode_t) >= sizeof(INTVAL)
             * to be able to store INTVAL constants in the bytecode
             *
             * This will become a major PITA for 64 bit archs, as:
             * - the hardware does not provide 64-bit immediate
             *   opcodes (not even amd64 - there is just one mov ins)
             * - it's causing huge bytecode size
             *
             * Therefore we might move integer constants or "big"
             * integer constants to the constant table too.
             * The PASM compiler can emit a special load instruction
             * (here set_i_icc) to fetch the constant.    
             */
            case ARG_ICC:
                args[i] = PCONST(idx);
                break;
            case ARG_SC:
                args[i] = PCONST(idx);
                break;
            case ARG_BRANCH:
                args[i] = (void*)(idx);
                break;
            default:
                assert("illegal arg" != NULL);
        }
    }
    return n + 1;
}

/*
 * function core runloop with partially inline switched opcodes
 */
static void 
run(Interp *interpreter, opcode_t *pc) 
{ 
    int n;
    void *args[ARG_MAX];
    struct op_info_t *op_info, *info;
    opcode_t op;

    op_info = interpreter->op_info;
    while (pc) {
        op = *pc;
#ifdef TRACE
        printf("PC %2d OP %d %s\n", pc - interpreter->code->byte_code, 
                op, op_info[op].name); 
#endif
        info = op_info + *pc;
        /* inline a few I'd expect 50-100 inline ops */
        switch (op) {
            case OP_sub_i_i_i:  /* inline op */
                REG_INT(pc[1]) = REG_INT(pc[2]) - REG_INT(pc[3]);
                pc += 4;
                break;
            case OP_if_i_ic:    /* inline op */
                if (REG_INT(pc[1])) {
                    pc += (int)pc[2];
                    break;
                }
                pc += 3;
                break;
            default:            /* func-only op */ 
                n = set_args(interpreter, info, pc, args);
                pc += n;
                if (info->op_func(interpreter, args)) {
                    /* if it branched, goto new pc */
                    pc = args[n - 1];
                }
        }
    }
}

/*
 * nanoparrot is also implementing switched and cgoto runcores with
 * the help of these macros 
 * In realiter the opcode compiler provides the necessary code
 * mangling
 */

#  define CASE(function) \
static int \
F_ ## function (Interp *interpreter, void **args) {

#  define DISPATCH
#  define ENDRUN 
#  define ENDDISPATCH
#  define NEXT return 0; }
#  define BRANCHED return 1; }

/*
 * dispatch loop / opcode (function) bodies i.e. the .ops files
 *
 * core.ops et al
 */

    DISPATCH
        CASE(end)
            args[0] = NULL;
            BRANCHED
        CASE(print_s)
            printf("%s", (char*)args[0]);
            NEXT
        CASE(print_i)
            printf("%d", (int)args[0]);
            NEXT
        CASE(set_i_i)
            /* there is just one set_i_ix functions */
            *(int*)args[0] = (int)args[1];
            NEXT
        CASE(if_i_i)
            if ((int)args[0]) {
                opcode_t *pc = args[2];
                pc += (int)args[1];
                args[2] = pc;
                return 1;
            }
            NEXT
        CASE(sub_i_i_i)
            *(int*)args[0] = (int)args[1] - (int)args[2];
            NEXT
        CASE(MAX)
            printf("illegal opcode\n");
            exit(1);
            NEXT
    ENDDISPATCH
ENDRUN

/*
 * opcode metadata, also generated by the opcode compiler
 */
static struct op_info_t op_info_init[] = {
    {
        "end",
        F_end,
        {0},
        1
    },
    {
        "print_sc",
        F_print_s,
        { ARG_SC },
        2
    },
    {
        "print_i",
        F_print_i,
        { ARG_I },
        2
    },
    {
        "set_i_ic",
        F_set_i_i,               /* just one function set_i_i */
        { ARG_IR, ARG_IC }, 
        3
    },
    {
        "set_i_icc",
        F_set_i_i,               /* just one function set_i_i */
        { ARG_IR, ARG_ICC },
        3
    },
    {
        "if_i_ic",
        F_if_i_i,
        { ARG_I, ARG_BRANCH },
        3
    },
    {
        "sub_i_i_i",
        F_sub_i_i_i,
        { ARG_IR, ARG_I, ARG_I },
        4
    }
};
    
static void
init(Interp *interpreter, opcode_t *prog)
{
    /*
     * create 1 register frame
     */
    interpreter->bp = calloc(NUM_REGISTERS, sizeof(struct Reg));
    /*
     * set opcode metadata
     */
    interpreter->op_info = op_info_init;
    /*
     * the "packfile"
     */
    interpreter->code = malloc(sizeof(struct pf));
    interpreter->code->byte_code = prog;

    /*
     * create a simplified constant table
     */
#define N_CONSTS 5
    interpreter->code->const_table = malloc(N_CONSTS * sizeof(char*));
    interpreter->code->const_table[0] = "\n";
    interpreter->code->const_table[1] = "done\n";
    interpreter->code->const_table[2] = "error\n";
    interpreter->code->const_table[3] = "usage: ./nanoparrot2 mops\n";
    interpreter->code->const_table[4] = (void*) 100000000;
}

int
main(int argc, char *argv[]) {
    opcode_t *prog;

    /*
     * the mops main loop
     */
    opcode_t mops[] =
    	{ OP_set_i_icc, 4, 4, 	/* set I4, n */
	  OP_print_i, 4, 	/* print I4 */
	  OP_print_sc, 0, 	/* print "\n" */
          OP_set_i_ic, 5, 1, 	/* set I5, 1 */
	  OP_sub_i_i_i, 4, 4, 5,	/* L1: sub I4, I4, I5 */
	  OP_if_i_ic, 4, -4,	/* if I4, L1 */
	  OP_print_sc, 1, 	/* print "done\n" */
	  OP_end 		/* end */
	};
    /* another program */
    opcode_t usage[] =
    	{
          OP_set_i_ic, 0, 2, 	/* set I0, 2 */
	  OP_if_i_ic, 0, 6,	/* if I0, L1      # test branching */
	  OP_print_sc, 2,	/* print "error\n" */
	  OP_end, 		/* end */
	  OP_print_sc, 3,	/* L1: print "usage...\n" */
	  OP_end 		/* end */
	};
    Interp *interpreter = malloc(sizeof(Interp));

    prog = usage;
    if (argc > 1) {
        if (!strcmp(argv[1], "mops"))
            prog = mops;
    }
    init(interpreter, prog);
    run(interpreter, prog);
    return 0;
}

/*
 * Local variables:
 * c-indentation-style: bsd
 * c-basic-offset: 4
 * indent-tabs-mode: nil
 * End:
 *
 * vim: expandtab shiftwidth=4:
*/

Re: Continuations and inferior runloops: Analysis and POC

Reply via email to