This is going to be extremely light on details with respect to the current state of
the Parrot interpreter.
It is also going to be expressed in terms of Win32 APIs.
For both of these I apologise in advance. Time, and the "or forever hold your peace"
imperative has overridden my desire to do
otherwise.
My attempts to get up speed enough on the sources to work out how to apply the ideas I
am going to express, and to become
conversant with the *nix pthreads implementation and terminology, have moved too
slowly to afford me the opportunity to do more.
Hoping this stimulates ideas rather than a flame war.
Regards, Nigel Sandever.
------------------------
THE BASIS OF THE IDEA
Modern OSs succeed in having multiple threads of execution share a single copy of
process memory without the operations of one
thread being able to interfere with the state of another. The state of the code
running in those threads may be corrupted through
mismanagement. But the state of the threads themselves, their program counters,
registers and stacks cannot. The mechanisms for
this incorruptibility are:
Each operation (opcode) performed by the thread is atomic. The scheduler can
never interrupt a thread whilst an operation
is in progress. Only between operations.
Before the start of an operation, and after the end of one, the state of the
thread is entirely encapsulated within the
registers and stack. By swapping the entire state of the CPU register set, when
switching from one thread to another, the state of
each thread is preserved and reconstituted. No other mechanisms or interlocks are
required.
By analogy, a virtual machine that wishes to have multiple threads of execution must
achieve the same level of atomicity for each
operation it performs.
VIRTUAL MACHINE INTERPRETER
At any given point in the running of the interpreter, the VM register set, program
counter and stack must represent the entire
state for that thread. Once an opcode has started execution on a given thread, no
other thread of execution within that interpreter
much be allowed to start an operation until the first thread completes its opcode.
NON-VMI THREADS
ASYNCHRONOUS IO
Note that this does not mean that no other thread in the process can take a timeslice,
only that any thread that is allowed to run
should not be able to affect the state of the VM in question. A thread charged with
performing asynchronous reads on behalf of the
user program running within the VM interpreter can go ahead so long as it doesn't
directly modify the VMI state.
EVENT MANAGER THREAD
Equally, an event thread can also be run concurrently in the background to receive
asynchronous notifications (signals, messages,
asynchronous read completions etc.). It can then queue these events and set a flag
that the VMI can inspect between each iteration
of the "opcode execution loop" and action appropriately. This gives the benefit of
"safe signals" along with safe and timely
processing of other, similarly asynchronous events.
GARBAGE COLLECTION
The garbage collector would need to run *synchronously* with the interpreter opcode
loop. Effectively, it would be another (possibly
long running) opcode. An analogy of this are all the long running syscalls that
operate within a multi-tasking OS. Eg. synchronous IO,
virtual memory allocation, swapping etc. Just as virtual memory operations suspend the
affected processes until the operations are
complete, so garbage collection can be see as a virtual memory operation for the
virtual machine that requires the VMI to be
suspended until the operation is complete.
PREREQUISITES
The key is for the VM to be reentrant, and the use of (in win32 terms) a "critical
section".
REENTRANCY
Not only must the VMI be coded in a reentrant fashion, with all state addressed
through pointers (references) loaded into it's
Virtual register set. All the code underlying it, including syscalls and CRT must
equally be reentrant. Many APIs within many CRTs
*are not reentrant* (eg. strtok()). All state must be on a per-thread not a
per-process basis.
To this end, I would propose that no CRT APIs are used directly from the main code!
Instead, a full set of CRT-like macros would be inherited from a header file, where
either the real CRT API would be called, or an
alternative implementation. This header file would be conditionally included on the
basic of the target platform. This concentrates
the bulk, if not all, platform specific code into a single file (or set of files).
EXPANDING THE VIRTUAL MACHINE
What I am suggesting here is that the virtual machine analogy be extended to encompass
not just the VHLL-user-program view of
the registers and stack, but also it's view of the entire machine and underlying OS.
By so doing, not only does it allow those working on the core of the VMI to be
isolated from the underlying machine and OS
architecture. It allows them to extend the VMI's view of the world, beyond the
concepts of any single architecture. It also serves
the very practical purpose of avoiding the morass of platform-conditional compilation
directives in the bulk of the code, by
concentrating them in single (groups of) files, by platform.
ATOMICITY AND CRITICAL SECTIONS
Atomicity of the VMI opcodes can be achieved by having the core loop of the
interpreter enter and exit a critical section each time
it processes an opcode. For those not familiar with critical sections, they are a
(Win32) OS construct that prevents any
cooperating thread within a process, from having timeslice until the thread that
entered the critical section has exited.
Unlike semaphores and events, critsecs only operate within a single process. They are
relatively lightweight as there is no need to
enter kernel mode for their operation. They are, in effect a CPU specific "test and
set" operation applied to a piece of user mode
memory. Their lightweight nature means that they are faster than inter-process
semaphore mechanisms. When used in a process that
currently has only a single thread in operation, they have almost negligible
performance effect upon that thread. They also operate
correctly on SMP machines.
PROCESS GLOBAL STATE
Unavoidably process global state, such as filehandles etc., should only be usable
(from the user program being run by the VMI)
through the use of opcodes. Done this way, the serialisation of the opcodes at the VMI
level protects the global state.
THIS IS ONLY PART OF THE SOLUTION
None of the above addresses the problems of synchronising user program state. This
will still need to be addressed. This could either
be done by the user code through the use of something like the P5 :shared and lock()
mechanisms. Or preferably, transparently by the
VMI itself. Possibly, by having each PMC contain an 'in-use' flag that would be set
each time an opcode (method) was dispatched on a
PMC (object) and cleared when the operation completes.
If two threads attempt concurrent operations on the same PMC, the first thread
accessing the PMC sets the flag. When a second
thread attempt to access it, it finds the flag set and blocks (relinquishing it's
timeslice) until the first thread has completed and
cleared the flag. This doesn't solve the potential for deadlock problems arising from
the user-level code, though it may be possible
for the VMI to detect and diagnose these at runtime in a similar way that
deep-recursion detection is done in P5.
The cost to the single-threaded user application is for a test&set operation (a few
cycles), per object-method, plus the flag, which
need only be one bit, but maybe more efficiently coded as a 32-bit entity.
CONCLUSION (TENTATIVE, NO PROOF)
As all VHLL entities are PMCs at the VMI level, the sharing of data between threads at
the VHLL level is done entirely through
those PMCs. If no single PMC can be the subject of an opcode on two threads
concurrently, there /should/ be no opportunity for
conflict.
As all VMI internal state are encapsulated within the VMI register set, and each
thread has it's own set of registers, the internal
state of the VMI(s) should be equally secure.
Other internal housekeeping operations, memory allocation, garbage collection etc. are
performed as "sysopcodes", performed by the
VMI within the auspices of the critical section, and thus secured.
All asynchronous operations are performed by one or more non-VMI threads and do not
adjust the state of the VMI directly.
Instead, they queue notifications of events, and results to the VMI, and these are
detected by the VMI within the body of the main
loop. Once detected, an appropriate sysopcode is dispatched within the critical
section in the normal way.
SOME ADDITIONAL IDEAS
MEMORY MANAGEMENT
I've been looking at (glancing through) the memory management used by Parrot. The
basic concept of memory pools, if I have
understood them correctly, is to allocate similarly size object from the same pool.
The idea behind this is that it will be easier to
find a suitable size piece of memory available for reallocation having been GC'd, from
a pool dedicated to allocation of a similar size
to that needed. This should reduce the fragmentation of originally large allocations
into lots of small allocations, thereby reducing
the need for coalescence, and overall memory allocation.
The thought struck me that with the exception of strings and (other) compound data
types, all the instances of any given class are
always the same size. Where an existing class is dynamically extended in some way, the
result is the creation of a new (anonymous)
class as pre-existing instances of the original class would not be modified. As such,
the class would seem to be the ideal point at
which to pool the allocation and deallocation of memory.
If each class is given it's own memory pool that is a multiple of its instance size.
That pool effectively becomes an (C-style) array
of instances. As each element is exactly the right size, the only time the pool would
need to grow, is when all the existing elements
are in-use and another is required. Once an instance has been freed, it's slot in the
array would be available and exactly the right
size for the next instantiation. There would never be a requirement to coalesce
allocations. Whether the free slots are found
through a free space chain, or even a linear search, the GC would only need to process
this pool when a new instance of this class is
being allocated. It would then only need to scan a single pool of (same-sized)
reference objects to GC the existing instances.
If all references to this type where allocated from another pool dedicated to
references-to-this-class, that also hung off the
parent class, then the volume of references to be scanned would also be minimised.
If the larger segments of memory allocated for each pool, were allocated as virtual
memory segments, rather than as arbitrary
chunks of heap memory, the size of each individual pool can be grown in-place, at
runtime, without the need to copy then existing
space to the new larger allocation. And without the need to shuffle any other memory
pools around to accommodate the increase in
size. The OS's Virtual Memory Management takes care of the whole process. It also
becomes possible to reserve a sizeable pre-
allocation for important pools and those known to be likely to grow quickly, without
actually consuming the entire reservation from
the physical memory pool before it is actually needed. The OS will then take care of
notification of an individual pool's need to grow
through page faults and the growth becomes a process of simply committing another
(few) pages from the pre-allocation.
The memory requirements of variable size objects can be managed through Global Heap
and sub-allocation APIs as now.
Thread specific local storage required for the VMI's register stacks etc. can be
handled using Thread Local Storage APIs.
COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.
Some applications benefit from the programmer being able to have much tighter control
over the scheduling of threads of execution
within the process. The extreme example of this are co-routines where control is
passed back and forth between two (or more) pieces
of code as the demands of the algorithm and data dictate rather than on the basis of
arbitrary time-slicing. Win32 has a concept of
Fibres. The are manually dispatched threads. A given thread (fibre) receives it's
timeslice in the normal way. However, it can
transfer the rest of it's current timeslice to another fibre on demand. The first
fibre remains suspended until the second fibre
transfers control back. The state of the first fibre is retained just as it would be
if the thread had been swapped by the scheduler.
In this way two (or more) fibres can cooperatively share their timeslices whilst other
threads (and processes) continue to receive
their timeslices in the normal way. This, I think, is the essence of co-routines?
I have no understanding (yet) of how the co-routines that Parrot is promising are
implemented on *nix systems. But unless the
concept of co-routines is encapsulated at the higher level with the implementation
being pushed down to platform specific code,
unless *nix also supports a concept very similar to Win32 fibres, it will be
impossible to utilise this "tailor made" OS facility. Thus,
the implementation of co-routines designed to operate well on *nix platforms is likely
to need to be emulated on Win32 with high
level requirements that won't fit well with the underlying OS. This could result in a
similarly sub-optimal emulation of a *nix
concept on Win32 as currently exists for the fork emulation. Other non-*nix platforms
would likely also suffer from this force fit
as well.