A high-level Win32 view of threading in the interpreter.

Nigel Sandever Thu, 01 Jan 2004 17:07:17 -0800

This is going to be extremely light on details with respect to the current state of 
the Parrot interpreter.


It is also going to be expressed in terms of Win32 APIs.

For both of these I apologise in advance. Time, and the "or forever hold your peace" 
imperative has overridden my desire to do 
otherwise. 

My attempts to get up speed enough on the sources to work out how to apply the ideas I 
am going to express, and to become 
conversant with the *nix pthreads implementation and terminology, have moved too 
slowly to afford me the opportunity to do more.

Hoping this stimulates ideas rather than a flame war.

Regards, Nigel Sandever.

------------------------

THE BASIS OF THE IDEA

Modern OSs succeed in having multiple threads of execution share a single copy of 
process memory without the operations of one 
thread being able to interfere with the state of another. The state of the code 
running in those threads may be corrupted through 
mismanagement. But the state of the threads themselves, their program counters, 
registers and stacks cannot. The mechanisms for 
this incorruptibility are: 

        Each operation (opcode) performed by the thread is atomic. The scheduler can 
never interrupt a thread whilst an operation 
is in progress. Only between operations.
        
        Before the start of an operation, and after the end of one, the state of the 
thread is entirely encapsulated within the 
registers and stack. By swapping the entire state of the CPU register set, when 
switching from one thread to another, the state of 
each thread is preserved and reconstituted. No other mechanisms or interlocks are 
required.

By analogy, a virtual machine that wishes to have multiple threads of execution must 
achieve the same level of atomicity for each 
operation it performs. 

VIRTUAL MACHINE INTERPRETER

At any given point in the running of the interpreter, the VM register set, program 
counter and stack must represent the entire 
state for that thread. Once an opcode has started execution on a given thread, no 
other thread of execution within that interpreter 
much be allowed to start an operation until the first thread completes its opcode. 

NON-VMI THREADS 

ASYNCHRONOUS IO

Note that this does not mean that no other thread in the process can take a timeslice, 
only that any thread that is allowed to run 
should not be able to affect the state of the VM in question. A thread charged with 
performing asynchronous reads on behalf of the 
user program running within the VM interpreter can go ahead so long as it doesn't 
directly modify the VMI state. 

EVENT MANAGER THREAD

Equally, an event thread can also be run concurrently in the background to receive 
asynchronous notifications (signals, messages, 
asynchronous read completions etc.). It can then queue these events and set a flag 
that the VMI can inspect between each iteration 
of the "opcode execution loop" and action appropriately. This gives the benefit of 
"safe signals" along with safe and timely 
processing of other, similarly asynchronous events.

GARBAGE COLLECTION

The garbage collector would need to run *synchronously* with the interpreter opcode 
loop. Effectively, it would be another (possibly 
long running) opcode. An analogy of this are all the long running syscalls that 
operate within a multi-tasking OS. Eg. synchronous IO, 
virtual memory allocation, swapping etc. Just as virtual memory operations suspend the 
affected processes until the operations are 
complete, so garbage collection can be see as a virtual memory operation for the 
virtual machine that requires the VMI to be 
suspended until the operation is complete.

PREREQUISITES

The key is for the VM to be reentrant, and the use of (in win32 terms) a "critical 
section".

REENTRANCY

Not only must the VMI be coded in a reentrant fashion, with all state addressed 
through pointers (references) loaded into it's 
Virtual register set. All the code underlying it, including syscalls and CRT must 
equally be reentrant. Many APIs within many CRTs 
*are not reentrant* (eg. strtok()). All state must be on a per-thread not a 
per-process basis.

To this end, I would propose that no CRT APIs are used directly from the main code!

Instead, a full set of CRT-like macros would be inherited from a header file, where 
either the real CRT API would be called, or an 
alternative implementation. This header file would be conditionally included on the 
basic of the target platform. This concentrates 
the bulk, if not all, platform specific code into a single file (or set of files). 

EXPANDING THE VIRTUAL MACHINE

What I am suggesting here is that the virtual machine analogy be extended to encompass 
not just the VHLL-user-program view of 
the registers and stack, but also it's view of the entire machine and underlying OS. 

By so doing, not only does it allow those working on the core of the VMI to be 
isolated from the underlying machine and OS 
architecture. It allows them to extend the VMI's view of the world, beyond the 
concepts of any single architecture. It also serves 
the very practical purpose of avoiding the morass of platform-conditional compilation 
directives in the bulk of the code, by 
concentrating them in single (groups of) files, by platform.


ATOMICITY AND CRITICAL SECTIONS

Atomicity of the VMI opcodes can be achieved by having the core loop of the 
interpreter enter and exit a critical section each time 
it processes an opcode. For those not familiar with critical sections, they are a 
(Win32) OS construct that prevents any 
cooperating thread within a process, from having timeslice until the thread that 
entered the critical section has exited. 

Unlike semaphores and events, critsecs only operate within a single process. They are 
relatively lightweight as there is no need to 
enter kernel mode for their operation. They are, in effect a CPU specific "test and 
set" operation applied to a piece of user mode 
memory. Their lightweight nature means that they are faster than inter-process 
semaphore mechanisms. When used in a process that 
currently has only a single thread in operation, they have almost negligible 
performance effect upon that thread. They also operate 
correctly on SMP machines.

PROCESS GLOBAL STATE

Unavoidably process global state, such as filehandles etc., should only be usable 
(from the user program being run by the VMI) 
through the use of opcodes. Done this way, the serialisation of the opcodes at the VMI 
level protects the global state.

THIS IS ONLY PART OF THE SOLUTION

None of the above addresses the problems of synchronising user program state. This 
will still need to be addressed. This could either 
be done by the user code through the use of something like the P5 :shared and lock() 
mechanisms. Or preferably, transparently by the 
VMI itself.  Possibly, by having each PMC contain an 'in-use' flag that would be set 
each time an opcode (method) was dispatched on a 
PMC (object) and cleared when the operation completes. 

If two threads attempt concurrent operations on the same PMC, the first thread 
accessing the PMC sets the flag. When a second 
thread attempt to access it, it finds the flag set and blocks (relinquishing it's 
timeslice) until the first thread has completed and 
cleared the flag. This doesn't solve the potential for deadlock problems arising from 
the user-level code, though it may be possible 
for the VMI to detect and diagnose these at runtime in a similar way that 
deep-recursion detection is done in P5.

The cost to the single-threaded user application is for a test&set operation (a few 
cycles), per object-method, plus the flag, which 
need only be one bit, but maybe more efficiently coded as a 32-bit entity.

CONCLUSION (TENTATIVE, NO PROOF)

As all VHLL entities are PMCs at the VMI level, the sharing of data between threads at 
the VHLL level is done entirely through 
those PMCs. If no single PMC can be the subject of an opcode on two threads 
concurrently, there /should/ be no opportunity for 
conflict. 

As all VMI internal state are encapsulated within the VMI register set, and each 
thread has it's own set of registers, the internal 
state of the VMI(s) should be equally secure. 

Other internal housekeeping operations, memory allocation, garbage collection etc. are 
performed as "sysopcodes", performed by the 
VMI within the auspices of the critical section, and thus secured.

All asynchronous operations are performed by one or more non-VMI threads and do not 
adjust the state of the VMI directly. 
Instead, they queue notifications of events, and results to the VMI, and these are 
detected by the VMI within the body of the main 
loop. Once detected, an appropriate sysopcode is dispatched within the critical 
section in the normal way.

SOME ADDITIONAL IDEAS

MEMORY MANAGEMENT

I've been looking at (glancing through) the memory management used by Parrot. The 
basic concept of memory pools, if I have 
understood them correctly, is to allocate similarly size object from the same pool. 
The idea behind this is that it will be easier to 
find a suitable size piece of memory available for reallocation having been GC'd, from 
a pool dedicated to allocation of a similar size 
to that needed. This should reduce the fragmentation of originally large allocations 
into lots of small allocations, thereby reducing 
the need for coalescence, and overall memory allocation.

The thought struck me that with the exception of strings and (other) compound data 
types, all the instances of any given class are 
always the same size. Where an existing class is dynamically extended in some way, the 
result is the creation of a new (anonymous) 
class as pre-existing instances of the original class would not be modified. As such, 
the class would seem to be the ideal point at 
which to pool the allocation and deallocation of memory. 

If each class is given it's own memory pool that is a multiple of its instance size. 
That pool effectively becomes an  (C-style) array 
of instances. As each element is exactly the right size, the only time the pool would 
need to grow, is when all the existing elements 
are in-use and another is required. Once an instance has been freed, it's slot in the 
array would be available and exactly the right 
size for the next instantiation. There would never be a requirement to coalesce 
allocations. Whether the free slots are found 
through a free space chain, or even a linear search, the GC would only need to process 
this pool when a new instance of this class is 
being allocated. It would then only need to scan a single pool of (same-sized) 
reference objects to GC the existing instances.

If all references to this type where allocated from another pool dedicated to 
references-to-this-class, that also hung off the 
parent class, then the volume of references to be scanned would also be minimised.

If the larger segments of memory allocated for each pool, were allocated as virtual 
memory segments, rather than as arbitrary 
chunks of heap memory, the size of each individual pool can be grown in-place, at 
runtime, without the need to copy then existing 
space to the new larger allocation. And without the need to shuffle any other memory 
pools around to accommodate the increase in 
size. The OS's Virtual Memory Management takes care of the whole process. It also 
becomes possible to reserve a sizeable pre-
allocation for important pools and those known to be likely to grow quickly, without 
actually consuming the entire reservation from 
the physical memory pool before it is actually needed. The OS will then take care of 
notification of an individual pool's need to grow 
through page faults and the growth becomes a process of simply committing another 
(few) pages from the pre-allocation. 

The memory requirements of variable size objects can be managed through Global Heap 
and sub-allocation APIs as now.

Thread specific local storage required for the VMI's register stacks etc. can be 
handled using Thread Local Storage APIs.

COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.

Some applications benefit from the programmer being able to have much tighter control 
over the scheduling of threads of execution 
within the process. The extreme example of this are co-routines where control is 
passed back and forth between two (or more) pieces 
of code as the demands of the algorithm and data dictate rather than on the basis of 
arbitrary time-slicing. Win32 has a concept of 
Fibres. The are manually dispatched threads. A given thread (fibre) receives it's 
timeslice in the normal way. However, it can 
transfer the rest of it's current timeslice to another fibre on demand. The first 
fibre remains suspended until the second fibre 
transfers control back. The state of the first fibre is retained just as it would be 
if the thread had been swapped by the scheduler. 
In this way two (or more) fibres can cooperatively share their timeslices whilst other 
threads (and processes) continue to receive 
their timeslices in the normal way. This, I think, is the essence of co-routines?

I have no understanding (yet) of how the co-routines that Parrot is promising are 
implemented on *nix systems. But unless the 
concept of co-routines is encapsulated at the higher level with the implementation 
being pushed down to platform specific code, 
unless *nix also supports a concept very similar to Win32 fibres, it will be 
impossible to utilise this "tailor made" OS facility. Thus, 
the implementation of co-routines designed to operate well on *nix platforms is likely 
to need to be emulated on Win32 with high 
level requirements that won't fit well with the underlying OS. This could result in a 
similarly sub-optimal emulation of a *nix 
concept on Win32 as currently exists for the fork emulation. Other non-*nix platforms 
would likely also suffer from this force fit 
as well.

A high-level Win32 view of threading in the interpreter.

Reply via email to