Threads Design. A Win32 perspective.

Nigel Sandever Thu, 01 Jan 2004 12:21:19 -0800

This is going to be extremely light on details with respect to the current state of 
the Parrot interpreter.


It is also going to be expressed in terms of Win32 APIs.

For both of these I apologise in advance. Time, and the "or forever hold your peace" 
imperative has overridden my desire to do otherwise. 

My attempts to get up speed enough on the sources to work out how to apply the ideas I 
am going to express, and to become conversant with the *nix 
pthreads implementation and terminology, have moved too slowly to afford me the 
opportunity to do more.

Hoping this stimulates discussion rather than a flame war.

Regards, Nigel Sandever.

PS.  Sorry if this gets posted twice.
------------------------

THE BASIS OF THE IDEA

Modern OSs succeed in having multiple threads of execution share a single copy of 
process memory without the operations of one thread being able to 
interfere with the state of another. The state of the code running in those threads 
may be corrupted through mismanagement. But the state of the 
threads themselves, their program counters, registers and stacks cannot. The 
mechanisms for this incorruptibility are: 

        Each operation (opcode) performed by the thread is atomic. 
            The scheduler can never interrupt a thread whilst an operation is in 
progress. Only between operations.
        
        Before the start of an operation, and after the end of one, the state of the 
thread is entirely encapsulated within the registers and stack. 
            By swapping the entire state of the CPU register set, when switching from 
one thread to another, the state of each thread is preserved
            and reconstituted. No other mechanisms or interlocks are required.

By analogy, a virtual machine that wishes to have multiple threads of execution must 
achieve the same level of atomicity for each operation it performs. 

VIRTUAL MACHINE INTERPRETER

At any given point in the running of the interpreter, the VM register set, program 
counter and stack must represent the entire state for that thread. 
Once an opcode has started execution on a given thread, no other thread of execution 
within that interpreter much be allowed to start an operation until 
the first thread completes its opcode. 

NON-VMI THREADS 

ASYNCHRONOUS IO

Note that this does not mean that no other thread in the process can take a timeslice, 
only that any thread that is allowed to run should not be able to 
affect the state of the VM in question. A thread charged with performing asynchronous 
reads on behalf of the user program running within the VM 
interpreter can go ahead so long as it doesn't directly modify the VMI state. 

EVENT MANAGER THREAD

Equally, an event thread can also be run concurrently in the background to receive 
asynchronous notifications (signals, messages, asynchronous read 
completions etc.). It can then queue these events and set a flag that the VMI can 
inspect between each iteration of the "opcode execution loop" and action 
appropriately. This gives the benefit of "safe signals" along with safe and timely 
processing of other, similarly asynchronous events.

GARBAGE COLLECTION

The garbage collector would need to run *synchronously* with the interpreter opcode 
loop. Effectively, it would be another (possibly long running) opcode. 
An analogy of this are all the long running syscalls that operate within a 
multi-tasking OS. Eg. synchronous IO, virtual memory allocation, swapping etc. 
Just as virtual memory operations suspend the affected processes until the operations 
are complete, so garbage collection can be see as a virtual memory 
operation for the virtual machine that requires the VMI to be suspended until the 
operation is complete.

PREREQUISITES

The key is for the VM to be reentrant, and the use of (in win32 terms) a "critical 
section".

REENTRANCY

Not only must the VMI be coded in a reentrant fashion, with all state addressed 
through pointers (references) loaded into it's Virtual register set. All 
the code underlying it, including syscalls and CRT must equally be reentrant. Many 
APIs within many CRTs *are not reentrant* (eg. strtok()). All state 
must be on a per-thread not a per-process basis.

To this end, I would propose that no CRT APIs are used directly from the main code!

Instead, a full set of CRT-like macros would be inherited from a header file, where 
either the real CRT API would be called, or an alternative 
implementation. This header file would be conditionally included on the basic of the 
target platform. This concentrates the bulk, if not all, platform 
specific code into a single file (or set of files). 

EXPANDING THE VIRTUAL MACHINE

What I am suggesting here is that the virtual machine analogy be extended to encompass 
not just the VHLL-user-program view of the registers and stack, 
but also it's view of the entire machine and underlying OS. 

By so doing, not only does it allow those working on the core of the VMI to be 
isolated from the underlying machine and OS architecture. It allows them 
to extend the VMI's view of the world, beyond the concepts of any single architecture. 
It also serves the very practical purpose of avoiding the morass 
of platform-conditional compilation directives in the bulk of the code, by 
concentrating them in single (groups of) files, by platform.


ATOMICITY AND CRITICAL SECTIONS

Atomicity of the VMI opcodes can be achieved by having the core loop of the 
interpreter enter and exit a critical section each time it processes an 
opcode. For those not familiar with critical sections, they are a (Win32) OS construct 
that prevents any cooperating thread within a process, from having 
timeslice until the thread that entered the critical section has exited. 

Unlike semaphores and events, critsecs only operate within a single process. They are 
relatively lightweight as there is no need to enter kernel mode for 
their operation. They are, in effect a CPU specific "test and set" operation applied 
to a piece of user mode memory. Their lightweight nature means that 
they are faster than inter-process semaphore mechanisms. When used in a process that 
currently has only a single thread in operation, they have almost 
negligible performance effect upon that thread. They also operate correctly on SMP 
machines.

PROCESS GLOBAL STATE

Unavoidably process global state, such as filehandles etc., should only be usable 
(from the user program being run by the VMI) through the use of opcodes. 
Done this way, the serialisation of the opcodes at the VMI level protects the global 
state.

THIS IS ONLY PART OF THE SOLUTION

None of the above addresses the problems of synchronising user program state. This 
will still need to be addressed. This could either be done by the user 
code through the use of something like the P5 :shared and lock() mechanisms. Or 
preferably, transparently by the VMI itself.  Possibly, by having each PMC 
contain an 'in-use' flag that would be set each time an opcode (method) was dispatched 
on a PMC (object) and cleared when the operation completes. 

If two threads attempt concurrent operations on the same PMC, the first thread 
accessing the PMC sets the flag. When a second thread attempt to 
access it, it finds the flag set and blocks (relinquishing it's timeslice) until the 
first thread has completed and cleared the flag. This doesn't solve the 
potential for deadlock problems arising from the user-level code, though it may be 
possible for the VMI to detect and diagnose these at runtime in a 
similar way that deep-recursion detection is done in P5.

The cost to the single-threaded user application is for a test&set operation (a few 
cycles), per object-method, plus the flag, which need only be one bit, 
but maybe more efficiently coded as a 32-bit entity.

CONCLUSION (TENTATIVE, NO PROOF)

As all VHLL entities are PMCs at the VMI level, the sharing of data between threads at 
the VHLL level is done entirely through those PMCs. If no single 
PMC can be the subject of an opcode on two threads concurrently, there /should/ be no 
opportunity for conflict. 

As all VMI internal state are encapsulated within the VMI register set, and each 
thread has it's own set of registers, the internal state of the VMI(s) 
should be equally secure. 

Other internal housekeeping operations, memory allocation, garbage collection etc. are 
performed as "sysopcodes", performed by the VMI within the 
auspices of the critical section, and thus secured.

All asynchronous operations are performed by one or more non-VMI threads and do not 
adjust the state of the VMI directly. Instead, they queue 
notifications of events, and results to the VMI, and these are detected by the VMI 
within the body of the main loop. Once detected, an appropriate 
sysopcode is dispatched within the critical section in the normal way.

SOME ADDITIONAL IDEAS

MEMORY MANAGEMENT

I've been looking at (glancing through) the memory management used by Parrot. The 
basic concept of memory pools, if I have understood them correctly, is 
to allocate similarly size object from the same pool. The idea behind this is that it 
will be easier to find a suitable size piece of memory available for 
reallocation having been GC'd, from a pool dedicated to allocation of a similar size 
to that needed. This should reduce the fragmentation of originally large 
allocations into lots of small allocations, thereby reducing the need for coalescence, 
and overall memory allocation.

The thought struck me that with the exception of strings and (other) compound data 
types, all the instances of any given class are always the same size. 
Where an existing class is dynamically extended in some way, the result is the 
creation of a new (anonymous) class as pre-existing instances of the 
original class would not be modified. As such, the class would seem to be the ideal 
point at which to pool the allocation and deallocation of memory. 

If each class is given it's own memory pool that is a multiple of its instance size. 
That pool effectively becomes an  (C-style) array of instances. As each 
element is exactly the right size, the only time the pool would need to grow, is when 
all the existing elements are in-use and another is required. Once an 
instance has been freed, it's slot in the array would be available and exactly the 
right size for the next instantiation. There would never be a requirement 
to coalesce allocations. Whether the free slots are found through a free space chain, 
or even a linear search, the GC would only need to process this pool 
when a new instance of this class is being allocated. It would then only need to scan 
a single pool of (same-sized) reference objects to GC the existing 
instances.

If all references to this type where allocated from another pool dedicated to 
references-to-this-class, that also hung off the parent class, then the 
volume of references to be scanned would also be minimised.

If the larger segments of memory allocated for each pool, were allocated as virtual 
memory segments, rather than as arbitrary chunks of heap memory, the 
size of each individual pool can be grown in-place, at runtime, without the need to 
copy then existing space to the new larger allocation. And without the 
need to shuffle any other memory pools around to accommodate the increase in size. The 
OS's Virtual Memory Management takes care of the whole 
process. It also becomes possible to reserve a sizeable pre-allocation for important 
pools and those known to be likely to grow quickly, without actually 
consuming the entire reservation from the physical memory pool before it is actually 
needed. The OS will then take care of notification of an individual 
pool's need to grow through page faults and the growth becomes a process of simply 
committing another (few) pages from the pre-allocation. 

The memory requirements of variable size objects can be managed through Global Heap 
and sub-allocation APIs as now.

Thread specific local storage required for the VMI's register stacks etc. can be 
handled using Thread Local Storage APIs.

COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.

Some applications benefit from the programmer being able to have much tighter control 
over the scheduling of threads of execution within the process. 
The extreme example of this are co-routines where control is passed back and forth 
between two (or more) pieces of code as the demands of the algorithm 
and data dictate rather than on the basis of arbitrary time-slicing. Win32 has a 
concept of Fibres. The are manually dispatched threads. A given thread 
(fibre) receives it's timeslice in the normal way. However, it can transfer the rest 
of it's current timeslice to another fibre on demand. The first fibre 
remains suspended until the second fibre transfers control back. The state of the 
first fibre is retained just as it would be if the thread had been 
swapped by the scheduler. In this way two (or more) fibres can cooperatively share 
their timeslices whilst other threads (and processes) continue to 
receive their timeslices in the normal way. This, I think, is the essence of 
co-routines?

I have no understanding (yet) of how the co-routines that Parrot is promising are 
implemented on *nix systems. But unless the concept of co-routines is 
encapsulated at the higher level with the implementation being pushed down to platform 
specific code, unless *nix also supports a concept very similar to 
Win32 fibres, it will be impossible to utilise this "tailor made" OS facility. Thus, 
the implementation of co-routines designed to operate well on *nix 
platforms is likely to need to be emulated on Win32 with high level requirements that 
won't fit well with the underlying OS. This could result in a 
similarly sub-optimal emulation of a *nix concept on Win32 as currently exists for the 
fork emulation. Other non-*nix platforms would likely also suffer 
from this force fit as well.

Threads Design. A Win32 perspective.

Reply via email to