Multi-cpu design

Martin Stjernholm, Roxen IS @ Pike developers forum Tue, 23 Dec 2008 17:50:14 -0800

Multi-cpu support in Pike
-------------------------

This is a draft spec for how to implement multi-cpu support in Pike.
The intention is that it gets extended along the way as more issues
gets ironed out. Discussions take place in "Pike dev" in LysKOM or
[email protected].

Initial draft created 8 Nov 2008 by Martin Stjernholm.

Background and goals

Pike supports multiple threads, but like many other high-level
languages it only allows one thread at a time to access the data
structures. This means that the utilization of multi-cpu and
multi-core systems remains low, even though there are some modules
that can do isolated computational tasks in parallell (e.g. the Image
module).

It is the so-called "interpreter lock" that must be locked to access
any reference variable (i.e. everything except floats and native
integers). This lock is held by default in essentially all C code and
is explicitly unlocked in a region by the THREADS_ALLOW/
THREADS_DISALLOW macros. On the pike level, the lock is always held -
no pike variable can be accessed and no pike function can be called
otherwise.

The purpose of the multi-cpu support is to rectify this. The design
goals are, in order of importance:

1. Pike threads should be able to execute pike code concurrently on
multiple cpus as long as they only modify thread local pike data
and read a shared pool of static data (i.e. the pike programs,
modules and constants).

2. There should be as few internal hot spots as possible (preferably
none) when pike code is executed concurrently. Care must be taken
to avoid internal synchronization, or updates of shared data that
would cause "cache line ping-pong" between cpus.

3. The concurrency should be transparent on the pike level. Pike code
should still be able to access shared data without locking and
without risking low-level inconsistencies. (So Thread.Mutex etc
would still be necessary to achieve higher level synchronization.)

4. There should be tools on the pike level to allow further
performance tuning, e.g. lock-free queues, concurrent access hash
tables, and the possibility to lock different regions of shared
data separately. These tools should be designed so that they are
easy to slot into existing code with few changes.

5. There should be tools to monitor and debug concurrency. It should
be possible to make assertions that certain objects aren't shared,
and that certain access patterns don't cause thread
synchronization. This is especially important if goal (3) is
realized, since the pike code by itself won't show what is shared
and what is thread local.

6. C modules should continue to work without source level
modification (but likely without allowing any kind of
concurrency).

Note that even if goal (3) is accomplished, this is no miracle cure
that would make all multithreaded pike programs run with optimal
efficiency on multiple cpus. One could expect better concurrency in
old code without adaptions, but it could still be hampered
considerably by e.g. frequent updates to shared data. Concurrency is a
problem that must be taken into account on all levels.

Other languages

Perl: All data is thread local by default. Data can be explicitly
shared, in which case Perl ensures internal consistency. Every shared
variable is apparently locked individually. Referencing a thread local
variable from a shared one causes the thread to die. See perthrtut(1).

Python: Afaik it's the same state of affairs as Pike.

Solution overview

The basic approach is to divide all data into thread local and shared:

o Thread local data is everything that is accessible to one thread
only, i.e. there are no references to anything in it from shared
data or from any other thread. This is typically data that the
current thread has created itself and only reference from the
stack. The thread can access its local data without locking.

o Shared data is everything that is accessible from more than one
thread. Access to it is synchronized using a global read/write
lock, the so-called "global lock". I.e. this lock can either be
locked for reading by many threads, or be locked by a single thread
for writing. Locking the global lock for writing is the same as
locking the interpreter lock in current pikes. (This single lock is
refined later - see issue "Lock spaces".)

o There is also a special case where data can be "disowned", i.e. not
shared and not local in any thread. This is used in e.g.
Thread.Queue for the objects that is in transit between threads.
Disowned data cannot have arbitrary references to it - it must
always be under the control of some object that in some way ensures
consistency. (Garbage could be made disowned since it by definition
no longer is accessible from anywhere, but of course it is always
better to clean it up instead.)

The principal use case for this model is that threads can do most of
their work with local data and read access to the shared data, and
comparatively seldom require the global write lock to update the
shared data. Every shared thing does not have its own lock since that
would cause excessive lock overhead.

Note that the shared data is typically the same as the data referenced
from the common environment (i.e. the "global data").

Also note that the current object (this) always is shared in pike
modules, so a thread cannot assume free access to it. In other pike
classes it would often be shared too, but it is still important to
utilize the situation when it is thread local.

A thread local thing, and all the things it references directly or
indirectly, automatically becomes shared whenever it gets referenced
from a shared thing.

A shared thing never automatically becomes thread local, but there is
a function to explicitly "take" it. It would first have to make sure
there are no references to it from shared or other thread local
things. Thread.Queue has a special case so that if a thread local
thing with no other refs is enqueued, it is disowned by the current
thread, and later becomes thread local in the thread that dequeues it.

Issue: Lock spaces

Having a single global read/write lock for all shared data could
become a bottleneck. Thus there is a need for shared data with locks
separate from the global lock. Things that share a common lock is
called a "lock space", and it is always possible to look up the lock
that governs any given thing (see issue "Memory object structure").

A special global lock space, which corresponds to the shared data
discussed above, is created on startup. All others have to be created
explicitly.

The intended use case for lock spaces is a "moderately large"
collection of things: Too large and you get outlocking problems, too
small and the lock overhead (both execution- and memorywise) gets
prohibiting. A typical lock space could be a RAM cache consisting of a
mapping and all its content.

Many different varieties of lock space locks can be considered, e.g. a
simple exclusive access mutex lock or a read/write lock, priority
locks, locks that ensure fairness, etc. Therefore different (C-level)
implementations should be allowed.

One important characteristic of lock space locks is whether they are
implicit or explicit:

Implicit locks are locked internally, without intervention on the pike
level. The lock duration is unspecified; locks are only acquired to
ensure internal consistency. All low level data access functions check
whether the lock space for the accessed thing is locked already. If it
isn't then the lock is acquired automatically. All implicit locks have
a well defined lock order (by pointer comparison), and since they only
are taken to guarantee internal consistency, an access function can
always free a lock to ensure correct order (see also issue "Lock
space locking").

Explicit locks are exposed to the pike level and must be locked in a
similar way to Thread.Mutex. If a low level data access function
encounters an explicit lock that isn't locked, it throws an error.
Thus it is left to the pike programmer to avoid deadlocks, but the
pike core won't cause any by itself. Since the pike core keeps track
which lock governs which thing it ensures that no lock violating
access occurs, which is a valuable aid to ensure correctness.

One can also consider a variant with a read/write lock space lock that
is implicit for read but explicit for write, thus combining atomic
pike-level updates with the convenience of implicit locking for read
access.

The scope of a lock space lock is (at least) the state inside all the
things it contains, but not the set of things itself, i.e. things
might be added to a lock space without holding a write lock (provided
the memory structure allows it). Removing a thing from a lock space
always requires the write lock since that is necessary to ensure that
a lock actually governs a thing for as long as it is held (regardless
it's for reading or writing).

FIXME: Allow removing garbage from a lock space without the write
lock?

See also issues "Memory object structure" and "Lock space locking" for
more details.

Issue: Memory object structure

Of concern are the refcounted memory objects known to the gc. They are
called "things", to avoid confusion with "objects" which are the
structs for pike objects.

There are three types of things:

o First class things with ref counter, lock space pointer, and
double-linked list pointers (to be able to visit all things in
memory, regardless of other references). Most pike visible types
are first class things. The exceptions are ints and floats, which
are passed by value, and strings and types.

o Second class things with ref counter and lock space pointer but no
double-linked list pointers. These are always reached through
pointers from one or more first class things. It's the job of the
visit functions for those first class things to ensure that the gc
visits these, thus they don't need the double-linked list pointers.
Only strings and types are likely to be of this type.

o Third class things contain only a ref counter. They are similar to
second class except that their lock spaces are implicit from the
referencing things, which means all those things must always be in
the same lock space.

Thread local things could have NULL as lock space pointer, but as a
debug measure they could also point to the thread object so that it's
possible to detect bugs with a thread accessing things local to
another thread.

Before the multi-cpu architecture, all first class things are linked
into the same global double-linked lists (one for each type: array,
mapping, multiset, object, and program). This gets split into one set
of double-linked lists for each thread and for each lock space. That
allows things to be added and removed to a thread or lock space
without requiring other locks (a lock-free double-linked list is
apparently difficult to accomplish). It also allows the gc to do
garbage collection locally in each thread and in each lock space
(although cyclic structures over several lock spaces won't be freed
that way).

A global lock-free hash table (see issue "Lock-free hash table") is
used to keep track of all lock space lock objects, and hence all
things they contain in their double-linked lists.

+----------+ +----------+
| Thread 1 | | Thread 2 |
+----------+ +----------+
// \\ // \\ // \\ // \\
,--- O O O O ,------------- O O O O ---.
| \\ // \\ // | \\ // \\ // |
ref | O O -. | ref O O | ref
| | | |
v refs ref | v v
O <----- O `--> O O O O
// \\ // \\ // \\ // \\ refs // \\ // \\
O O -> O O O O O O <----> O O O O
\\ // \\ // \\ // \\ // \\ // \\ //
+--------------+ +--------------+ +--------------+
| Lock space 1 | | Lock space 2 | | Lock space 3 |
+--------------+ +--------------+ +--------------+
^________ ^ ____^
| | |
+-----------------------+-|-+-----+-|-+-------+-|-+-----------------
| | X | | X | | X | ...
+-----------------------+---+-----+---+-------+---+-----------------

Figure 2: "Space Invaders". The O's represent things, and the \\ and
// represent the double-linked lists. Some examples of references
between things are included, and at the bottom is the global hash
table with pointers to all lock spaces.

Accessing a lock space lock structure from the global hash table
requires a hazard pointer (c.f. issue "Hazard pointers"). Accessing it
from a thing is safe if the thread controls at least one ref to the
thing, because a lock space has to be empty to delete the lock space
lock struct.

Issue: Lock space lock semantics

There are three types of locks:

o A read-safe lock ensures only that the data is consistent, not that
it stays constant. This allows lock-free updates in things where
possible (which could include arrays, mappings, and maybe even
multisets and objects of selected classes).

o A read-constant lock ensures both consistency and constantness
(i.e. what usually is assumed for a read-only lock).

o A write lock ensures complete exclusive access. The owning thread
can modify the data, and it can assume no other changes occur to it
(barring refcounters - see below). The owning thread can also under
limited time leave the data in inconsistent state. This is however
still limited by the calls to check_threads(), which means that the
state must be consistent again every time the evaluator callbacks
are run. See issue "Emulating the interpreter lock".

Allowing lock-free updates is attractive, so the standard read/write
lock that governs the global lock space will probably be multiple
read-safe/single write.

An exception to the lock semantics above are the reference counters in
refcounted things (c.f. issue "Refcounting and shared data"). A ref to
a thing can always be added or removed if it is certain that the thing
cannot asynchronously disappear. That means:

o Refcount changes must always be atomic, even when a write lock is
held.
o The refcount may be incremented or decremented when any kind of
read lock is held.
o The refcount may be incremented or decremented without any kind of
lock at all, provided the same thread already holds at least one
other ref to the same thing. This means another thread might hold a
write lock, but it still won't free the thing since the refcount
never can reach zero.
o A thing may be freed if its refcount is zero and a write lock is
held.

FIXME: Whether or not to free a thing if its refcount is zero and only
some kind of read lock is held is tricky. To allow that it's necessary
to have an atomic-decrement-and-get instruction (can be emulated with
CAS, though) to ensure no other thread is decrementing it and reaching
zero at the same time. Lock-free linked lists are also necessary to
make unlinking possible. Barring that, we need to figure out a policy
for scheduling frees of things reaching refcount zero during read
locks.

Issue: Lock space locking

Assuming that a thread already controls at least one ref to a thing
(so it won't be freed asynchronously), this is the locking process
before accessing it:

1. Read the lock space pointer. If it's NULL then the thing is thread
local and nothing more needs to be done.
2. Address an array containing the pointers to the lock spaces that
are already locked by the thread.
3. Search for the lock space pointer in the array. If present then
nothing more needs to be done.
4. Lock the lock space lock as appropriate. Note that this can imply
that other implicit locks that are held are unlocked to ensure
correct lock order (see issue "Lock spaces"). Then it's added to
the array.

A thread typically won't hold more than a few locks at any time (less
than ten or so), so a plain array and linear search should perform
well. For quickest possible access the array should be a static thread
local variable (c.f. issue "Thread local storage"). If the array gets
full, implicit locks in it can be released automatically to make
space. Still, a system where more arrays can be allocated and chained
on would perhaps be prudent to avoid the theoretical possibility of
running out of space for locked locks.

"Controlling" a ref means either to add one "for the stack", or
ensuring a lock on a thing that holds a ref. Note that implicit locks
might be released in step 4, so unless the thread controls a ref to
the referring thing too, it might no longer exist afterwards, and
hence the thing itself might be gone.

Since implicit locks can be released (almost) at will, they are open
for performance tuning: Too long lock durations and they'll outlock
other threads, too short and the locking overhead becomes more
significant. As a starting point, it seems reasonable to release them
at every evaluator callback call (i.e. at approximately every pike
function call and return).

Issue: Refcounting and shared data

Using the traditional refcounting on shared data could easily produce
hotspots: Some strings, shared constants, and the object instances for
pike modules are often accessed from many threads, so their refcounts
would be changed frequently from different processors.

E.g. making a single function call in a pike module requires the
refcount of the module object to be increased during the call since
there is a new reference from a pike_frame. The refcounters in the
module objects for commonly used modules like Stdio.pmod/module.pmod
could easily become hotspots.

Atomic increments and decrements are not enough to overcome this - the
memory must not be changed at all to avoid slow synchronizations
between cpu local caches.

Observation: Refcounters become hotspots primarily in globally
accessible shared data, which for the most part has a long lifetime
(i.e. programs, module objects, and constants). Otoh, they are most
valuable in short-lived data (shared or not), which would produce lots
of garbage if they were to be reaped by the gc instead.

Following this observation, the problem with refcounter hotspots can
to a large degree be mitigated by simply turning off refcounting in
the large body of practically static data in the shared runtime
environment.

A good way to do that is to extend the resolver in the master to mark
all programs it compiles, their constants, and the module objects, so
that refcounting of them is disabled. To do this, there has to be a
function similar to Pike.count_memory that can walk through a
structure recursively and mark everything in it. When those things
lose their refs, they will always become garbage that only is freed by
the gc.

Question: Is there data that is missed with this approach?

A disabled refcounter is recognized by a negative value and flagged by
setting the topmost two bits to one and the rest to zero, i.e. a value
in the middle of the negative range. That way, in case there is code
that steps the refcounter then it stays negative. (Such code is still
bad for performance and should be fixed, though.)

Disabling refcounting requires the gc to operate differently; see
issue "Garbage collection and external references".

Another alternative is to actually cease to use refcounting altogether
and instead use a generational garbage collector that can reap the
newest data quickly and frequently. Lock space local garbage
collection will also help here.

Issue: Strings

Strings are unique in Pike. This property is hard to keep if threads
have local string pools, since a thread local string might become
shared at any moment, and thus would need to be moved. Therefore the
string hash table remains global, and lock congestion is avoided with
some concurrent access hash table implementation. See issue "Lock-free
hash table".

Lock-free is a good start, but the hash function must also provide a
good even distribution to avoid hotspots. Pike currently uses an
in-house algorithm (DO_HASHMEM in pike_memory.h). Replacing it with a
more widespread and better studied alternative should be considered.
There seems to be few that are below O(n) (which DO_HASHMEM is),
though.

Issue: Types

Like strings, types are globally unique and always shared in Pike.
That means lock-free access to them is desirable, and it should also
be doable fairly easily since they are constant (except for the
refcounts which can be updated atomically). Otoh it's probably not as
vital as for strings since types typically only are built during
compilation.

Types are more or less always part of global shared data. That
suggests they should have their refcounts disabled most of the time
(see issue "Refcounting and shared data"). But again, since types
typically only get built during compilation, their refcounts probably
won't become hotspots anyway. So it looks like they could be exempt
from that rule.

Issue: Shared mapping and multiset data blocks

An interesting issue is if things like mapping/multiset data blocks
should be second or third class things (c.f. issue "Memory object
structure"). If they're third class it means copy-on-write behavior
doesn't work across lock spaces. If they're second class it means
additional overhead handling the lock spaces of the mapping data
blocks, and if a mapping data is shared between lock spaces then it
has to be in some third lock space of its own, or in the global lock
space, neither of which would be very good.

So it doesn't look like there's a better way than to botch
copy-on-write in this case.

Issue: Emulating the interpreter lock

For compatibility with old C modules, and for the _disable_threads
function, it is necessary to retain a complete lock like the current
interpretator lock. It has to lock the global area for writing, and
also stop all access to all lock spaces, since the thread local data
might refer to any lock space.

This lock is implemented as a read/write lock, which normally is held
permanently for reading by all threads. Only when a thread is waiting
to acquire the compat interpreter lock is it released as each thread
goes into check_threads().

This lock cannot wait for explicit lock space locks to be released.
Thus it can override the assumption that a lock space is safe from
tampering by holding a write lock on it. Still, it's only available
from the C level (with the exception of _disable_threads) so the
situation is not any different from the way the interpreter lock
overrides Thread.Mutex today.

Issue: Function calls

A lock on an object is almost always necessary before calling a
function in it. Therefore the central apply function (mega_apply) must
ensure an appropriate lock is taken. Which kind of lock
(read-safe/read-constant/write - see issue "Lock space lock
semantics") depends on what the function wants to do. Therefore all
object functions are extended with flags for this.

The best default is probably read-safe. Flags for no locking (for the
few special cases where the implementations actually are completely
lock-free) and for compat-interpreter-lock-locking should probably
exist as well. A compat-interpreter-lock flag is also necessary for
global functions that don't have a "this" object (aka efuns).

Having the required locking declared this way also alleviates each
function from the burden of doing the locking to access the current
storage, and it allows future compiler optimizations to minimize lock
operations.

Issue: Exceptions

"Forgotten" locks after exceptions shouldn't be a problem: Explicit
locks are handled just like today (i.e. it's up to the pike
programmer), and implicit locks can safely be released when an
exception is thrown.

One case requires attention: An old-style function that requires the
compat interpreter lock might catch an error. In that case the error
system has to ensure that lock is reacquired.

Issue: C module interface

A new add_function variant is probably added for new-style functions.
It takes bits for the flags discussed for issue "Function calls".
New-style functions can only assume free access to the current storage
according to those flags; everything else must be locked (through a
new set of macros/functions).

Accessor functions for data types (e.g. add_shared_strings,
mapping_lookup, and object_index_no_free) handles the necessary
locking internally. They will only assume that the thing is safe, i.e.
that the caller ensures the current thread controls at least one ref.

THREADS_ALLOW/THREADS_DISALLOW and their likes are not used in
new-style functions.

There will be new GC callbacks for walking module global pointers to
things (see issue "Garbage collection and external references").

Issue: C module compatibility

Ref issue "Emulating the interpreter lock".

Ref issue "Garbage collection and external references".

Issue: Garbage collection and external references

The current gc design is that there is an initial "check" pass that
determines external references by counting all internal references,
and then for each thing subtract it from its refcount. If the result
isn't zero then there are external references (e.g. from global C
variables or from the C stack) and the thing is not garbage.

Since refcounting can be disabled in some objects (see issue
"Refcounting and shared data"), this approach no longer work; the gc
has to be changed to find external references some other way:

References from global C variables are few, so they can be dealt with
by requiring C modules and the core parts to provide callbacks that
lets the gc walk through them (see issue "C module interface"). This
is however not compatible with old C modules.

References from C stacks are common, and it is infeasible to require
callbacks that keep track of them. The gc instead has to scan the C
stacks for the threads and treat any aligned machine word containing
an apparently valid pointer to a known thing as an external reference.
This is the common approach used by standalone gc libraries that don't
require application support. For reference, here is one such garbage
collector, written in C++:
http://developer.apple.com/DOCUMENTATION/Cocoa/Conceptual/GarbageCollection/Introduction.html#//apple_ref/doc/uid/TP40002427
Its source is here:
http://www.opensource.apple.com/darwinsource/10.5.5/autozone-77.1/

The same approach is also necessary to cope with old C modules (see
issue "C module compatibility"), but since global C level pointers are
few, it might not be mandatory to get this working.

Btw, using this approach to find external refs should be considerably
more efficient than the old "check" pass, even if C stacks are scanned
wholesale.

Issue: Local garbage collection

Each thread periodically invokes a gc that only looks for garbage in
the local data of that thread. This can naturally be done without
disturbing the other threads. It follows that this gc also can be
disabled on a per-thread basis. This is a reason for keeping thread
local data in separate double-linked lists (see issue "Memory object
structure").

Similarly, if gc statistics are added to each lock space, they could
also be gc'd for internal garbage at appropriate times when they get
write locked by some thread. That might be interesting since known
cyclic structures could then be put in lock spaces of their own and be
gc'd efficiently without a global gc. Note that a global gc is still
required to clean up cycles with things in more than one lock space.

Issue: Global pike level caches

Issue: Thread.Queue

A lock-free implementation should be used. The things in the queue are
typically disowned to allow them to become thread local in the reading
thread.

Issue: False sharing

False sharing occurs when thread local things used frequently by
different threads are next to each other so that they share the same
cache line. Thus the cpu caches might force frequent resynchronization
of the cache line even though there is no apparent hotspot problem on
the C level.

This can be a problem in particular for all the block_alloc pools
containing small structs. Using thread local pools is seldom a
workable solution since most thread local structs might become shared
later on.

One way to avoid it is to add padding (and alignment). Cache line
sizes are usually 64 bytes or less (at least for Intel ia32). That
should be small enough to make this viable in many cases.

FIXME: Check cache line sizes on the other important architectures.

Worth noting that the problem is greatest for the frequently changed
ref counters at the start of each thing, so the most important thing
is to keep ref counters separated. I.e. things larger than a cache
line can probably be packed without padding.

Another way is to move things when they get shared, but that is pretty
complicated and slow.

Issue: Malloc and block_alloc

Standard OS mallocs are usually locking. Bundling a lock-free one
could be important. FIXME: Survey free implementations.

Block_alloc is a simple homebrew memory manager used in several
different places to allocate fixed-size blocks. The block_alloc pools
are often shared, so they must allow efficient concurrent access. With
a modern malloc, it is possible that the need for block_alloc is gone,
or perhaps the malloc lib has builtin support for fixed-size pools.
Making a lock-free implementation is nontrivial, so the homebrew ought
to be ditched in any case.

A problem with ditching block_alloc is that there is some code that
walks through all allocated blocks in a pool, and also avoids garbage
by freeing the whole pool altogether. FIXME: Investigate alternatives
here.

Multi-cpu design

Reply via email to