Author: allison
Date: Tue Nov 20 07:41:38 2007
New Revision: 22910

Modified:
   trunk/docs/pdds/draft/pdd25_concurrency.pod

Log:
[pdd] Substantial update of the concurrency PDD (25) to reflect current design.


Modified: trunk/docs/pdds/draft/pdd25_concurrency.pod
==============================================================================
--- trunk/docs/pdds/draft/pdd25_concurrency.pod (original)
+++ trunk/docs/pdds/draft/pdd25_concurrency.pod Tue Nov 20 07:41:38 2007
@@ -14,17 +14,6 @@
 
 $Revision$
 
-=head1 DEFINITIONS
-
-I<Concurrency> is a parallel execution of units of code (on
-multiprocessor machines), or a flexible ordering of units of code (on
-single processor machines). For certain problem spaces, concurrency
-offers significant speed gains by parceling out processor-intensive
-activity or by ensuring that a wait for input or system resources
-doesn't hold up the entire application.
-
-A I<Task> is a unit of code that can be executed in parallel.
-
 =head1 DESCRIPTION
 
 =over 4
@@ -32,969 +21,298 @@
 =item - Parrot supports multiple concurrency models, including POSIX
 threads, event-based programming, and asynchronous I/O.
 
-=item - To reduce conflicts between concurrency models, Parrot provides
-a single central concurrency scheduler for each interpreter instance.
-Each concurrency model defines a Task PMC that supports a standard
-minimal interface. The scheduler can interact with tasks from various
-models without direct access to the details of each model.
-
-=item - On multiprocessor systems, the scheduler is responsible for
-allocating tasks to processors, or for delegating that allocation to the
-underlying OS.
+=item - A concurrency scheduler manages all concurrent tasks
 
+=item - Each interpreter has its own concurrency scheduler
 
-=back
+=item - Concurrency schedulers for different interpreters
+communicate and can share tasks
 
-=head1 DEFINITIONS
+=item - A concurrency scheduler may link to other schedulers as a
+parent, a child, or an equal
 
-So we can all talk about things the same way, the following
-definitons apply. Some of these are drawn from the POSIX thread spec,
-and as such we should have a translation section at the end.
+=item - A task is a concurrent unit of work
 
-=over 4
-
-=item THREAD
-
-An OS level thread. If that makes no sense, neither will any of the
-rest of this, in which case I recommend picking up "Programming with
-POSIX Threads" by Dave Butenhof, and coming back when you have.
+=item - All tasks support a standard interface used by the concurrency
+scheduler, but otherwise have a great deal of flexibility in their
+implementation
 
-=item MUTEX
+=item - Tasks can share PMC variables
 
-This is a low level, under the hood, not exposed to users, thing that
-can be locked. They're non-recursive, non-read/write, exclusive
-things. When a thread gets a mutex, any other attempt to get that
-mutex will block until the owning thread releases the mutex. The
-platform-native lock construct will be used for this.
+=back
 
-{{ - Nigel Sandever: Will this be macroised to hide the platform native
-implementation from the main body of the code?
-- Dan: Yes. }}
+=head1 DEFINITIONS
 
-=item LOCK
+Concurrency is a parallel execution of units of code (on
+multiprocessor machines), or a flexible ordering of units of code (on
+single processor machines). For certain problem spaces, concurrency
+offers significant speed gains by parceling out processor-intensive
+activity or by ensuring that a wait for input or system resources
+doesn't hold up the entire application.
 
-This is an exposed-to-HLL-code thing that can be locked. Only PMCs can
-be locked, and the lock may or may not be recursive or read/write.
+A task is a unit of code that can be executed in parallel.
 
-=item CONDITION VARIABLE
+=head1 IMPLEMENTATION
 
-The "sleep until something pings me" construct. Useful for queue
-construction. Not conceptually associated with a MUTEX. (POSIX
-threads require this, so we're going to hide it there behind macros
-and/or functions)
+Rather than defining a single canonical threading model, Parrot defines
+an infrastructure that supports multiple concurrency models and provides
+for interaction between the various models. Parrot already uses multiple
+concurrency models for events, threads, async I/O, and exceptions, a
+trend that will only continue as we support multiple HLLs and external
+threading libraries like Intel's Threading Building Blocks. Designing
+for multiple concurrency models also gives Parrot more room to grow as
+future models are researched and developed.
+
+To avoid conflicts between concurrency models, Parrot provides a single
+central concurrency scheduler for each interpreter instance.  Each
+concurrency model defines a Task PMC that supports a standard minimal
+interface. The scheduler can interact with tasks from different models
+without direct access to the details of each model.
+
+On multiprocessor systems, the scheduler is responsible for allocating
+tasks to processors, or for delegating that allocation to the underlying
+OS.
+
+For the most part, when we talk about concurrency, we mean concurrency
+across an interpreter pool. An interpreter pool is a set of interpreter
+instances that share common resources: the memory pools, arenas, and
+global namespace--pretty much everything except what's in the
+interpreter structure itself. They're essentially threads in the OS
+sense. 
+
+Another form of concurrency is between completely independent
+interpreter instances, each with their own memory pools, arenas,
+namespaces, etc. Independent interpreters may run as separate processes
+on the same machine, or even as separate processes on different machines
+(in a clustering environment, for example). The concerns of
+shared-interpreter concurrency and independent-interpreter concurrency
+are similar, and in Parrot both use the same central concurrency
+scheduler. This PDD doesn't directly address independent-interpreter
+concurrency, but does include occasional notes on how it integrates with
+shared-interpreter concurrency.
+
+=head2 Supported Concurrency Models
+
+The following are a few of the concurrency models Parrot intends to
+support. The biggest differences between them are in how they handle
+variables shared across concurrent tasks.  But the design is such that
+
+=head3 Mutex/Lock Concurrency
+
+In this model, before performing an operation on a shared variable, you
+must first acquire a lock on it. Once a lock is acquired, any other
+concurrent tasks that attempt to acquire a lock on that variable will
+block until the existing lock is released.
+
+A mutex is a thing that can be locked. They are not directly exposed to
+users. They're non-recursive, non-read/write, exclusive things.  When a
+concurrent task gets a mutex, any other attempt to get that mutex will
+block until the owning task releases the mutex. Mutexes are implemented
+using the platform-native lock construct.
 
-{{ - Nigel Sandever: Could you elaborate on the nature of what would constitute
-a "ping"?
-- Dan: POSIX has a function cond_signal (and cond_broadcast, which is
-similar). What happens is a thread grabs the condition variable and
-goes to sleep, and sleeps until another thread does a cond_signal,
-which then wakes it up. If there are multiple threads sleeping on the
-condition variable, only one is woken. (cond_broadcast wakes them all
-up) }}
+The first thing that any vtable function of a shared PMC must do is to
+acquire the mutex of the PMCs in its parameter list (in ascending
+address order). In this model only PMCs can be shared.
 
-=item RENDEZVOUS POINT
+=head3 STM Concurrency
 
-A HLL version of a condition variable.
+Parrot's preferred model of concurrency is based on Software
+Transactional Memory. In this model, rather than locking a shared
+variable while performing a series of operations on it, the changes are
+bundled into a transaction that acts as an atomic unit.
+
+Within the transaction, STM creates a "hypothetical" copy of the
+variable, logs the changes made to that copy, and at the end of the
+transaction performs some validation steps to decide whether to save the
+hypothetical value back to the real variable (a commit) or discard the
+hypothetical value (a roll back). One common validation step is to check
+whether the value of the real variable was changed during the execution
+of the transaction (possibly by another concurrent task).
+
+STM tasks can read/write shared variables from mutex/lock tasks, as they
+appear to the mutex/lock task as a single atomic operation. Mutex/lock
+tasks can read shared variables from STM tasks, but they cannot write
+them, as the STM tasks will not respect the lock and may commit a new
+value in the middle of a complex operation that requires the lock. As a
+safety mode, STM tasks may be configured to fail validation on any
+transaction attempting to commit to a variable locked by a mutex/lock
+task.
+
+=head3 POSIX Concurrency
+
+This is the POSIX "share-everything" style of threading, such as is used
+in Perl 5's "pthread" model, as well as the thread models for Ruby and
+Python. [Recommended reading: "Programming with POSIX Threads" by Dave
+Butenhof.]
+
+=head3 Process-type Concurrency
+
+This is the Perl 5 "iThreads" threading model. In this model no data is
+shared implicitly, and all sharing must be done on purpose and
+explicitly. It resembles the Unix
+fork-process-with-shared-memory-segment model, not a surprise as it was
+originally developed with emulation of Unix's fork system in mind. 
+
+=head3 Independent Concurrency
+
+Independent tasks have no contact with the internal data of any other
+task in the current process. These are implemented as STM concurrency
+but only use transactions for the shared interpreter globals.
 
-=item INTERPRETER
+Note that independent tasks may still communicate back and forth by
+passing either atomic things (ints, floats, and pointers) or static
+buffers that can become the property of the destination thread.
 
-Those bits of the Parrot_Interp structure that are absolutely required
-to be thread-specific. This includes the current register sets and
-stack pointers, as well as security context information. Basically if
-a continuation captures it, it's the interpreter.
+=head3 Intel Threading Building Blocks
 
-=item INTERPRETER ENVIRONMENT
+Threading Building Blocks (TBB) is a library of tools for data-parallel
+programming, dividing large data sets into small pieces so that
+operations on those data-sets can be parallelized across multiple
+processors.
+
+Parrot will provide two levels of integration with TBB: an interface for
+TBB's scheduling to interact with the central concurrency scheduler, and
+an interface for developers to access the TBB routines from within
+PIR/PASM.
+
+Like Parrot, TBB is task-based. Since TBB performs its own scheduling,
+TBB tasks in Parrot will be given a lightweight scheduler that only has
+the responsibility of passing messages, events, etc, back and forth
+betwen the TBB task and the central scheduler. TBB tasks will not share
+variables with any other types of concurrent tasks in Parrot.
+
+Note that since TBB is a C++ library, it is only available when Parrot
+is compiled with a C++ compiler.
+
+=head2 Concurrrency Scheduler API
+
+The concurrency scheduler has two parts, a Scheduler PMC, which has an
+instance stored in the interpreter struct, and a set of core routines in
+F<src/scheduler.c>.
 
-Those bits of the Parrot_Interp structure that aren't required to be
-thread-specific (though I'm not sure there are any) I<PLUS> anything
-pointed to that doesn't have to be thread-specific.
+An instance of the Scheduler PMC has 3 internal attributes, which are:
 
-The environment includes the global namespaces, pads, stack chunks,
-memory allocation regions, arenas, and whatnots. Just because the
-pointer to the current pad is thread-specific doesn't mean the pad
-I<itself> has to be. It can be shared.
+=over 4
 
-=item INDEPENDENT THREAD
+=item 1
 
-A thread that has no contact I<AT ALL> with the internal data of any
-other thread in the current process. Independent threads need no
-synchronization for anything other than what few global things we
-have. And the fewer the better, though alas we can't have none at all.
+The task list
 
-Note that independent threads may still communicate back and forth by
-passing either atomic things (ints, floats, and pointers) or static
-buffers that can become the property of the destination thread.
+=item 2
 
-=item SHARED THREAD
+The task priority index
 
-A thread that's part of a group of threads sharing a common
-interpreter environment.
+=item 3
 
-{{ - Leo: I presume that this "group of threads" is one thread pool or
-interpreter pool. Could you expand the definition to cover "pool". 
-- Dan: Ah, damn, I needed to fix that up before sending it out. Should've been
-"SHARED INTERPRETER". A shared interpreter is one that's in an interpreter
-pool.
-
-An interpreter pool is a set of interpreters that share common
-resources. They're essentially threads in the OS sense, and they have
-shared access to pretty much everything. The memory pools, arenas,
-and global namespace are shared--pretty much everything except what's
-in the interpreter structure itself. }}
+The list of handlers
 
 =back
 
-=head1 REQUIREMENTS
+The task list is a simple unordered integer indexed data structure, currently
+implemented as a hash. Each task in the list has an integer ID assigned when it
+is first inserted into the list. A task retains the same ID throughout its
+lifetime, and the ID is not reused once a task is finalized. (The data
+structure is currently implemented as a hash so that it only takes up the
+memory required for currently living tasks. A task list for a particular
+program may use thousands of task IDs, but only need memory allocated for a
+handful of elements at any given moment.)
+
+The task rank index is calculated based on the type, priority rating, age of
+the tasks in the task list. The index is a simple array, and in general the top
+(zeroth) element in the array is the next one to receive attention. The index
+is recalculated regularly as new tasks are inserted into the task list,
+existing tasks are modified or completed, and as time progresses so the age of
+some tasks pushes them to a higher priority. Because of the regular
+recalculation, the rank index may cache some frequently-accessed and rarely
+changing data from the tasks (though it is not required to do so). (As a later
+optimization, some data structure other than an array may be used to speed up
+rank recalculation. For example, with a hash of hashes of arrays keyed on task
+attributes, the process of inserting new tasks at a relative priority to other
+existing tasks could be performed without shifting the rank of all lower ranked
+tasks.)
+
+The list of handlers is a simple stack of handler PMCs currently waiting for an
+appropriate task (event, exception). See PDD 24 on Events for more details on
+event handlers.
 
-=head2 Supported Models
+=head2 Task PMC API
 
-=over 4
+The interface of the Task PMC is also the minimum required interface for all
+subclasses, extensions, and alternate implementations of a task.
 
-=item POSIX threads
+An instance of the Task PMC has 5 internal attributes, which are:
 
-The threading scheme must be sufficient to support a POSIX
-share-everything style of threading, such as is used in perl 5's
-"pthread" model, as well as the thread models for Ruby and Python.
-
-=item "Process-type" threads
-
-The scheme must also support the perl 5 "iThreads" threading
-model. In this model no data is shared implicitly, and all sharing
-must be done on purpose and explicitly. It very much resembles the
-Unix fork-process-with-shared-memory-segment model, not a surprise as
-it was originally developed with emulation of Unix's fork system in
-mind. 
-
-{{ - Nigel Sandever: Pseudo forks?
-- Dan: Yeah. On Win32, when Perl forks it starts a new thread and clones the
-interpreter and its contents. If you later do an exec in that thread it starts
-a new process and grabs hold of it. Definitely a clever trick.  }}
+=over 4
 
-=back
+=item 1
 
-=head2 Guarantees
+The type of the task
 
-{{ -Dave Whipp: Maybe this isn't strictly a threading thing, but are we going
-to make any guarantees about event orderings? For example, will we guarantee
-that a sequence of events send from one thread to another will always be
-received in the order they are sent? (Obviously no guarantees about
-interleaving of events from other threads). This is usually only important in
-distributed environments where multiple paths exists between a pair of
-endpoints, but it would be nice for user-code to be able to rely on it. 
-- Dan: Hrm. I suppose we really ought to, so we will. If we prioritize
-events (and I'm still torn) then they'll be in priority then send
-order. 
-}}
+=item 2
 
-=over 4
+The priority of the task
 
-=item No Crashes
+=item 3
 
-The interpreter guarantees that no user program errors of any sort
-will crash the interpreter. This includes threading problems. As
-such, synchronization issues (where multiple interpreters are
-accessing the same shared data) must not crash the interpreter or
-corrupt its internal state.
+The birthtime stamp of the task
 
-=back
+=item 4
 
-=head2 Assumptions
+The code block of the task (optional)
 
-=over 4
+=item 5
 
-=item System memory allocation routines are threadsafe
+An interpreter structure for the task (optional)
 
-We are assuming that the memory allocation system of the base OS is
-threadsafe. While some of the C runtime libraries are notoriously
-thread dangerous, memory allocation code generally is threadsafe, and
-we'll assume that on all platforms. (Though we will, in general,
-manage our own memory)
-
-{{ - Nigel Sandever: Is there any likelyhood that memory allocation will be
-hidden behind macros at two levels:
- - ALLOCPOOL() for allocating large chunks of memory (ppols) that are
-   later sub-allocated (and managed) by:
- - SUBALLOC() for sub allocating within a pool
-
-- Dan: We'll generally hide behind the memory.c routines, if for no other
-reason than to allow the embedder to override the memory functions.  We need to
-define that at some point... 
-- Gordon Henriksen: Are you wanting something akin to Apache 2 pools, which are
-hierarchical and designed to reduce path length when freeing blocks of objects?
-For instance, all objects and sub-pools allocated during an HTTP request cycle
-can be deallocated just by free()'ing the top-level request pool.
-
-I don't think parrot could make use of that model, since it can't very
-well guarantee that the user cannot retain references past the lifetime
-of the pool. Apache trusts modules not to make such errors; parrot can't
-trust the byte-code it's executing any further than it can throw it. A
-generational collector is a more likely means by which parrot might
-reduce memory-related overhead. 
-- Nigel Sandever: Nothing to do with Apache memory pools.
-
-I believe that parrot already has the concept of memory pools in it's
-memory management. The idea is that by allocating similarly sized objects
-from separate (large) allocations, you can reduce the fragmentation of
-chunks and reduce the incidences where the memory need to be GC'd and
-compacted.
-
-Allocating an 8 byte chunk from a common memory pool is quite likely to
-nip a little off from a previously freed large chunk. When it comes time
-reallocate another chunk the same size as that large, freed chunk, although
-there is enough room in the over all freespace chain to accommodate it, the
-largest available chunk is now 8 bytes or so too small for the requirement.
-
-That induces either a compaction cycle or the need to extend the memory pool
-by the size of the large request.
-
-Allocating all small requests from the same pool, and large from another
-pool means that you are less likely to fragment the memory and more likely
-to be able to re-use an existing slot in the free-space chain for any
-given request.
-
-If the allocation of pools, and the allocation of bit-of-a-pool, are
-macroised, it makes it possible for OS's where there are multiple APIs
-for memory allocation to bypass the CRT memory allocation routines and
-use which ever native APis are best suited for the type of allocation.
-
-Personally, I would like to see memory allocation for each class type
-be managed by the class constructor itself. This would theoretically allow
-each class that has a fixed instance size to manage it's own pool on OS's
-where that makes sense. The class would allocate a pool for itself when
-loaded and then allocate instances from that pool on new() and deallocate
-upon DESTROY. If it's memory pool was exhausted when new was called,
-it would invoke the GC on *it's pool only*.
-
-This separation would mean that each run of the GC would have a much smaller
-pool of memory to compact and garbage collect when it was invoked. It would
-also be less likely to be called, as each allocation from a pool of fixed
-sized sub allocations will only ever need to call the GC when it's pool is
-entirely exhausted.
-
-But that is a radical departure :), so if would just like to see separate calls
-for pool allocation/reallocation and element allocation/reallocation, rather
-than having calls to malloc() scattered through out the codebase. 
-
-- Gordon Henrikson: Don't presume that [allocating similarly sized objects
-reduces fragmentation and GC]. General purpose memory allocators tend to handle
-this scenario... generically. Use of custom allocators is usually a premature
-optimization; the best generic memory allocators are very, very good.
-
-But the ultimate effect of the overuse of custom allocators or pools is
-to move memory fragmentation to a higher level, preventing the generic
-allocator from addressing it even when the memory allocation patterns
-would otherwise allow it.
-
-Would much rather see [the allocation of pools] abstracted, as they are now,
-beneath an API.  This reduces the number of ways in which compiled parrot
-extensions can be incompatible with the parrot core.
-
-A PMC class is free to [manage the memory allocation for its
-own class constructor]. Only the PMC header cannot be thus managed, and that's
-already pooled.
-
-Th[e idea that such a separation means a smaller pool to compact and garbage
-collect] is false. The mark phase will still need to run over the entire
-process, else it cannot detect all references into the pool. (Unless you can
-provide a type-safety guarantee that your type can only be referenced by other
-instances of itself. In which case, your type is necessarily "garbage" and the
-best optimization strategy would be dead-code elimination. :)
-
-- Nigel Sandever: If by reference, you mean address, then that is true.
-
-If when a reference is taken, the address of the referent is
-stored in arbitrary other data structures, then all memory
-be scanned by the GC,
-
-However, if references were not addresses, but an index into a
-table of addresses, then only the table need be scanned.
-
-When a reference, or the thing holding it, is deleted, then
-the indexed slot in the address table is blanked and subsequent
-GC passes looking for references to a referent will no longer
-find the reference in the table.
-
-With the addition of a reference count to the table, all references
-to a single entity can share the same index, but now I'm groping
-back in the direction of reference counting, which is not flavour
-of the month:) 
-
-- Gordon Henrikson: I predict the best overall throughput would be with
-the sweep or copy phase run immediately after the mark phase, across the
-entire process: It would be wasteful to do all that marking and yet
-leave known garbage uncollected.
-
-Statistically, multiple pools will individually exhaust themselves
-*MORE* frequently than a global pool. The ideal case for collection
-frequency is that there is only one pool, or that all pools rise to
-capacity concurrently. In these ideal cases, the segregated pools will
-exhaust themselves precisely *AS* frequently as a global pool. In any
-case, there is no possibility for a decrease in collection events
-through the use of pooled memory. As per above, collection events will
-also not become less expensive. Thus, expect garbage collection to have
-a negative net impact on performance if pooled memory is used. 
-
-- Leo: The problem is not in the fixed sized header pools, its with the
-*memory* pool used e.g for string memory.
-
-During GC we are walking the *header* pools, and if we find its buffer
-memory being used, we move that buffer memory to a new store, thereby
-compacting it. The old memory store(s) are freed then.
+=back
 
-So string memory can move around beyond your code. 
-}}
+Types of tasks include 'event', 'exception', 'io', and 'code'.
 
-=back
+The birthtime stamp is the point at which the task was inserted into the task
+list, and is used for calculating the age of tasks.
 
-=head1 Proposal
+The code block is optional and only for tasks that are associated with a simple
+code block. The interpreter structure is also optional and only used for
+thread-like tasks that maintain their own interpreter state.
 
-The proposal is as follows:
+=head2 Opcodes
 
 =over 4
 
-=item All global state shall be protected by mutexes
-
-Straightforward enough. This allows for independent threads to
-coexist without threatening the state of the proces.
+=item new
 
-=item Multiple independent interpreters will be allowed
+  $P1 = new 'Task'
 
-Once again, straightforward enough. With threadsafe protected global
-state, there's no issue here.
-
-=item Only one OS thread in an interpreter at once
-
-While there is no requirement that any interpreter be tied to an
-underlying OS thread, under no circumstances may multiple OS threads
-use a single interpreter simultaneously.
-
-=item A Stop-and-copy communication method will be provided
-
-Parrot will provide a function to make a call into another interpreter
-and wait for that call to complete. This call may pass in data and
-have data returned to it. The interpreter making the call will block
-until the call is complete. The data passed in as parameters will be
-copied into the called interpreter, and any return values will be
-copied back into the calling interpreter. The called interpreter will
-block while the return data is copied back into the calling interpreter.
-
-=item Inter-interpreter events will be provided
-
-Interpreters will be able to post events to other interpreters.
-
-=item Each interpreter will have a unique id
-
-This ID will be independent of the process or OS thread, and will be
-constant across the lifetime of the interpreter. Interpreter IDs
-I<may> be reused as interpreters are destroyed and recreated, and as
-such are only guaranteed valid while an interpreter is in use.
-
-(Note that we may decide to relax this requirement, but doing so
-likely means moving to at least 64-bit integers to mark interpreter IDs)
-
-{{ - Nigel Sandever: The provision of method(s) to obtain the native
-TIDs/HANDLES would make life for those writing implementation specific
-extensions easier.
-- Dan: TIDs, definitely. It'll be a sysinfo or interpinfo function. There's the
-issue of interpreters binding to different threads, but we'll deal with that.
-I'm not sure what the HANDLE is, but explain it and we can work something out.
-:) }}
-
-=item Each interpreter show the same process id
-
-All the interpreters within a process will share a process ID. On
-those systems where each thread has its own unique ID (such as many
-versions of Linux) Parrot will still report a single process ID for
-all interpreters.
-
-This process ID will be the ID of the process that first instantiated
-Parrot.
-
-{{ - Leo: The term "process id" is really misleading. Again I presume that one
-pool is meant here. I'd vote for keeping a process ID what it is and use "pool
-id" or such here. 
-- Dan: Nope. It's the process ID. Nothing misleading there, unless you've done
-threading work under linux, since for a while it gave each thread a separate
-PID. 
-- Leo: $ ps | grep [p]arrot
-17472 pts/0    00:00:00 parrot
-17473 pts/0    00:00:00 parrot
-17474 pts/0    00:00:00 parrot
-
-So the unless applies ;) This is a single parrot interpreter, with main,
-thread-manager, and event-handler thread. 3 PIDs.
-
-- Liz Mattijsen: This _all_ depends on which version of Linux you're using.
-Older versions don''t do it that way, and newer versions don't do it either
-(the specific versions escape me at the moment, but I know RH9 does _not_ have
-pids for threads).
-
-I know, because my Thread::Signal module depends on threads having
-pids.  But fewer and fewer Linux systems "support" it (and Linux was
-the only one who worked this way to begin with).
-
-- Dan: Right. Linux is, as I said, arguably broken here, but I don't want to
-argue that point. That's why I specified that the PID is the process id of
-whatever instantiated parrot -- in this case, no matter which thread you're in,
-when you ask for the pid from parrot you should get 17472. (In this case, at
-least)
-
-Parrot needs to grab the pid at global initialization time and store
-it away for later pid queries. }}
-
-
-=item Interpreter pools will share allocation pools
-
-All the interpreters in an interpreter pool will share header and
-memory allocation pools. This means that when there is more than one
-interpreter in a pool the memory allocation and collection system
-needs to be swapped out, as a copying collector is generally
-untenable in a threaded environment.
-
-{{ - Dan: For a copying collector to work, all the mutators must be
-blocked, and arguably all readers should be blocked as well. (I think
-they need to be, otherwise you may be accessing reclaimed and reused
-memory) That means if we keep a copying collector we need to have
-everything that accesses strings or pmcs to get a lock on the GC
-before every access of every string or PMC. A touch excessive, I
-think. 
-- Gordon Henriksen: True of non-moving collectors, too. Or, let me put it this
-way: non- copying *GC* (the sweep or copy phase) can be threadsafe, but the
-mark phase is never threadsafe. The method in which marking is not threadsafe
-is a bit more pathological (i.e., it's not the common case as it is with the
-copying collector), but a standard tracing DOD cannot be correct when
-competeting with mutators. It WILL collect non- garbage (those are MUTATORS,
-remember), and the result WILL be Heizenbugs and crashes.
-
-Some of what I've written up [AR see next comment] addresses why. It's pretty
-simple to demonstrate a single case to prove the point, but I don't feel like
-re-creating the ASCII art right now. :) I'll send that section when I get out
-of the office.
-
-parrot will have to be able to suspend all threads in the environment.
-Unfortunate, but really quite unavoidable. 
-
-- Dan: I'm not sure that's the case. What we need to do is suspend
-metadata mutation--that is, buffers can't be resized while a gc run is
-in progress. Other than that, if we have guarantees of aligned atomic
-pointer writes (so we don't have word shearing to deal with) we don't
-have to stop the mutation of the contents of the buffers themselves.
-
-The only tricky bit comes in with the examination of the root set of
-other threads--accessing the hardware register contents of another
-running thread may be... interesting. (Which, I suppose, argues for
-some sort of volatile marking of the temp variables) 
-
-- Leo: You'll provide the "interesting" part, that is:
-
-  use Psi::Estimate::CPU_Register_Changes_in_Future_till_mark_is_done; 
-
-- Dan: Nah, no need for that one. I need to go back and recheck the
-stuff that Gordon posted in case I missed something, but if you put a
-lock on the arena allocator this isn't an issue. 
-
-- Gordon Henriksen: Consider this simple object graph:
-
-     Root set = { &A, &B }
-
-     [ A=NULL]
-     [ B=&C  ] ---> [ C=....]
-
-Collection begins in thread 1 and marks A as reachable before being
-preempted by thread 2:
-
-     [xA=NULL]
-     [ B=&C  ] ---> [ C=....]
-
-Thread 2 sets A to &C, and nullifies B before being preempted again by
-thread 1:
-
-     [xA=&C  ] ---> [ C=....]
-     [ B=NULL]
-
-Thread 1 marks B as reachable:
-
-     [xA=&C  ] ---> [ C=....]
-     [xB=NULL]
-
-The mark phase complete, thread 1 sweeps:
-
-     [ A=&???] ---> ???
-     [ B=NULL]
-
-C was not marked reachable (although it was) and was thus erroneously
-collected, leaving a dangling pointer. This problem applies equally to
-copying and mark-sweep collectors. 
-
-- Dan: Ah, OK, I see. The problem comes in where we've got an object in the
-transient root set (basically the processor stack and registers) that
-gets anchored into the base root set (stash, pads, or whatever) after
-the DOD has traced where it's going into and falls out of the
-transient root set before the DOD traces over to it.
-
-Race condition. Dammit.
-
-- Gordon Henriksen: (Worse than that. It could come from any untraced
-location--or possibly even be brand new, depending upon memory
-allocation details.) 
-
-- Dan: Okay, I'd not wrapped my brain around that possibility, which will
-make for some interesting DOD tracing, especially on SMP systems. I
-was *really* hoping a single lock on the arena allocation system that
-the DOD held onto while tracing would be sufficient, but I see that
-it isn't.
-
-That means we're going to have to have either a really forgiving DOD
-system that takes multiple passes before it collects up a PMC or
-buffer (which still isn't safe) or have some way to force a
-low-overhead rendezvous.
-
-The obvious rendezvous point is the arena lock, but that's going to
-see a lot of contention anyway, and we'd as soon not have a single
-one for speed reasons. Feh.
-
-Okay, we're going to mandate that we have read/write locks, each
-interpreter pool has one, and mutating vtable entries must get a read
-lock on the pool read/write lock. Pricey (ick) but isolated to
-mutators. The DOD gets a write lock on it, which'll block all
-read/write access so no mutators can be in process while the pool DOD
-runs.
-
-I think that'll work. The .1j thread spec requires r/w locks, and we
-can fake them on platforms that don't implement them. Hopefully
-Nigel's got the Windows scoop so we can see if Win32 has anything
-like this (which I'd expect it does) 
-
-- Leo: That is well knowm in the literature as "Tri-Color Invariant": Black are
-the already marked (live) PMCs, grey the PMCs on the next_for_GC list,
-white the not yet reached PMCs. The strong tri-color invariants states
-that no black object may point to a white object, the weak invariant
-states, that at least one path from the black to a white object must
-contain a grey one.
-
-This can be handled by either stop the world GCs or by intercepting each
-read or write access that would change the color of an object and update
-the color accordingly. This is e.g. used for incremental GC. As soon as
-we have a thread in the background that runs GC, we have to cope with
-these issues. 
-- Dan: Yeah, point. And since we want to be able to have an incremental DOD
-at some point we need to get support for it in now. 
-
-- Leo: Stopping all interpreters seems to be cheaper. The rwlock will sooner or
-later stop all interpreters anyway (on first PMC access), so we can omit the
-price for the rwlock and just hold the world(s).
-- Dan: The rwlock only stops all the interpreters when the DOD runs.
-Anything that mutates a PMC gets a *read* lock so that they don't
-interfere with each other, and only pause if the DOD is running. The
-DOD getting a *write* lock will block any read lock attempts, so when
-the DOD is running no mutation can take place. Since mutation doesn't
-require any global exclusion it doesn't need a write lock -- the read
-lock is sufficient. 
-- Leo: Sure. But that would still need to aquire a readers rwlock for each PMC
-access. This is more expensive then a mutex. During DOD any PMC access
-will halt the interpreter, so we can do that explicitely too and save
-the cost for the rwlock overhead. Albeit I can imagine, that aggregates
-will need a rwlock anyway. 
-- Dan: Well... only the mutating vtable entries need to get the lock, so
-that reduces the expense somewhat. Still, I agree, it may be
-untenably expensive. 
-
-- Leo: An alternative would be real background incremental GC, *when* running
-multiple threads. I estimate the overhead to be in regions of a rwlock (with no
-contention of course). 
-- Dan: If we have the facilities to do incremental DOD runs then this is
-definitely a possibility except for finalizers. Finalizers make
-things interesting, though if the background thread doing the DOD is
-a member of the interpreter pool then it'd work out OK. 
-- Leo: Finalizers and incremental DOD don't play together. The DOD must run to
-end to be sure, that the objects isn't referenced any more.
-- Dan: Finalizers and incremental DOD work just fine together. At some
-point the incremental DOD will figure out that something's dead, just
-as the stop-the-world DOD will. It just may take a bit longer. 
-- Leo: I wanted to say: "Finalizers & destructors of PMCs that need
-timely destruction ...". In the case of dead objects at scope exit. 
-
-
-- Leo: What about temporary PMCs (or strings)? Evaluating a non-trivial
-expression can have lot of these. Each new temp would need a lock on the
-memory sub-system.
-- Dan: Those won't get marked as shared unless we're unconditionally marking
-things as shared. (Though we may just give 'em a mutex anyway) [New temps would
-need a lock] only on allocation. We could have a per-thread freelist if we
-wanted.  Wouldn't be unreasonable. 
-- Leo: This needs either one check per PMC, if its shared or not, or additional
-costs for locking temps. Both are rather expensive, compared to the raw
-working functionality of a vtable. 
-
-That already smells like separate memory sub-systems. The freelist has
-to be filled first. During DOD runs, it has to be refilled. To achieve
-some locality of the temp PMCs, you can't just give these to arbitrary
-intpreters. Separate free lists seem also to imply separate DOD runs to
-cleanup the temps. 
-
-- Leo: Combined with the cost of: "All shared PMCs must have a
-threadsafe vtable - The first thing that any vtable function of a shared PMC
-must do is to aquire the mutex of the PMCs in its parameter list " 
-i.e. typically 3 mutexes, it could be that the vtable of a shared PMC
-should just grab one interpreter lock (of the owner of that PMC) and
-that not all is shared (especially not the temps). 
-- Dan: Yeah, but since PMCs aren't owned by any one interpreter in the pool
-but are rather pool-wide you run into either the global-lock problem
-(which kills performance on SMP systems) or interpreters potentially
-deadlocking on unrelated data access. 
-- Leo: Could be. But the performance is the overall throughput. When a lot of
-fine grained locks (plus memory subsystem locks for temps) take much
-longer then one global lock, then that scheme can nethertheless be
-slower on SMP machines. It would scale better though for more CPUs.
-
-Deadlocks shouldn't be a problem, if exactly one vtable (like in
-SharedRef) just grabs one and only one lock. 
-
-- Leo: [The Guarantees section] doesn't have anything about user data
-integrity. So when 2 threads access a PerlNum, they could get a
-mixture of the typically 2 involved words.
-- Dan: Potentially, yeah, though it's really unlikely.
-- Leo: But: "The first thing that any vtable function of a shared PMC
-must do is to aquire the mutex of the PMCs in its parameter list..."
-... seems to indicate that even whole ops like add P,P,P are atomic.
-- Dan: Yep. They have to be, because they need to guarantee the integrity of
-the pmc structures and the data hanging off them (which includes
-buffer and string stuff)
-- Leo: But isn't that a contradiction? Or better: When even an opcode
-like above is atomic, that an access to a shared PerlNum should be
-guaranteed being atomic too. 
-- Dan: Sure, but there's a *lot* more to user data integrity than atomic
-access to individual pieces. That's the least of the problem. The
-user data issue is one where you have multiple pieces being updated,
-or one piece being updated multiple times--that's the stuff we're not
-guaranteeing.
-
-So, while we will make sure that storing a single value into a hash
-happens atomically, we won't guarantee that a series of stores into
-the hash, or a combination of loads and stores, or even a combination
-of reads and writes to a scalar, will happen atomically. 
-
-- Leo: And how does user level locking play together with that?
-- Dan: I've not decided -- That was something I thought we might hash out as
-we abused the first half of the design doc. Personally I'm all for
-the user and low-level lock being the same thing, but that doesn't
-have to be the case. There are advantages and disadvantages to either
-way of doing things. 
-
-- Damien Neil: Personally, I think it would be better to use
-corruption-resistant buffer and string structures, and avoid locking
-during basic data access.  While there are substantial differences in VM
-design--PMCs are much more complicated than any JVM data type--the JVM
-does provide a good example that this can be done, and done efficiently.
-
-Failing this, it would be worth investigating what the real-world
-performance difference is between acquiring multiple locks per VM
-operation (current Parrot proposal) vs. having a single lock
-controlling all data access (Python) or jettisoning OS threads
-entirely in favor of VM-level threading (Ruby).  This forfeits the
-ability to take advantage of multiple CPUs--but Leopold's initial
-timing tests of shared PMCs were showing a potential 3-5x slowdown
-from excessive locking.
-
-I've seen software before that was redesigned to take advantage of
-multiple CPUs--and then required no less than four CPUs to match
-the performance of the older, single-CPU version.  The problem was
-largely attributed to excessive locking of mostly-uncontested data
-structures. 
-}}
-
-As the allocation and collection system is a black box to user
-programs and much of the interpreter internals, this isn't a big deal
-outside of needing swappable allocation systems, the potential
-issue of COW'd shared memory leaking, and the need to switch
-allocation schemes mid-execution.
-
-=item Each interpreter has a separate event queue
-
-Some events, such as timers, may be interpreter-specific and, as
-such, each interpreter has its own event queue.
-
-=item Each interpreter pool has a shared event queue
-
-Some events, such as IO callbacks, may not be interpreter-specific,
-and can be serviced by any interpreter in the interpreter pool. For
-these events, there is a pool-wide event queue.
-
-=item PMCs are the coordination point for threads
-
-That is, only PMCs are shared as such between threads. Strings,
-specifically, are I<not> shared between interpreters as such
-
-=item All PMCs shared amongst interpreters in a pool must be marked shared
-
-A PMC which is not marked shared may not be handed to another
-interpreter. Parrot will prevent this from happening either by
-marking the PMC as shared, or throwing an exception when the PMC is
-placed in a spot where it may be shared but is not shareable.
+Creates a new task. (The Scheduler PMC is never instantiated directly, it is
+only used by Parrot internals.)
 
-=item All shared PMCs must have a threadsafe vtable
+=item schedule
 
-The first thing that any vtable function of a shared PMC must do is to
-acquire the mutex of the PMCs in its parameter list, in ascending
-address order. When the mutexes are released they are not required to
-be released in any order.
-
-{{ - Uri: why the ascending address order to grab mutexes? is this to help 
solve
-deadlocks?
-- Dan: Yes. The recommendation I've always seen for deadlock avoidance is to
-always have all your code grab its mutexes in some fixed order. With source,
-it's generally recommended that you grab mutex variables in lexical order.
-Since we're all binary we need a different order, and ascending addresses are
-reasonably simple to do. 
-- Gordon Henriksen: Note that this locking strategy cannot avoid deadlock if
-the user is allowed to acquire these locks--HLL locks must be altogether
-different beasts from automatic PMC locks. That's okay. Just a design
-consequence worth noting for everyone.
-- Dan: Oh, arguably it can't avoid deadlock at all, what with vtable methods
-having access to the full environment. I can live with deadlocks, only because
-there's no real alternative. 
-- Gordon Henriksen: But PMC implementations have to fall inside of the trusted
-environment, so that's not really a failure. Of course uncooperative code can
-break a cooperative algorithm.  
-
-- Sam Vilain: RE: "have all your code grab its mutexes in some fixed order."
-
-Yes; otherwise, you need to back off and start again, if one lock
-acquisition fails.
-
-Consider these functions; for the purpose of this example, lexical
-lock ordering is implied;
-
-  func1($AAAA, $CCCC, $FFFF, $GGGG, $KKKK);
-  func2($BBBB, $DDDD, $FFFF, $IIII, $KKKK);
-  func3($FFFF, $KKKK);
-
-So long as the locks are ramped up from the lowest to the highest,
-there is no chance of func1 acquiring a lock to be held by func2 (eg,
-$KKKK), if that other function already has one of the shared
-dependancies (eg, $FFFF).
-
-All well and good.  But, what about recursive locks?
-
-ie
-
- sub func1($one is locked, $two is locked, $three is locked) {
-     for my $x ($one, $two, $three) {
-         func2($x.property) if $x.property;
-     }
- }
-
- sub func2($eins is locked) {
-     if ($eins.property) {
-         func2($eins.property, { });
-     }
- }
-
- $aaa.property = { };
- $bbb.property.property = $aaa;
- $ccc = { };
-
- if (thork()) {   # fork a thread
-    # thread A
-    func1($aaa, $bbb, $ccc);
- }
- else {
-    # thread B
-    func2($bbb.property);
- }
-
-OK, the execution order that causes the deadlock is:
-
-  1. Thread B - func2() acquires a lock on the $bbb.property PMC.
-  2. Thread A - func() acquires a lock on $aaa, $bbb, $ccc.
-  3. Thread A - func2() acquires a lock on $aaa.property, which
-     returns quickly
-  4. Thread A - func2() blocks waiting for a lock on $bbb.property,
-     held by func2() in thread B
-  5. Thread B - func2() blocks waiting for a lock on
-     $bbb.property.property, held by func() in thread A.
-
-So, there are still possibilities for deadlocks, as the non-linear
-nature of subroutine calls screws up your nice lock ramping.
-
-In summary, acquiring mutexes in a defined order as a means to avoid
-deadlocks only works when you are acquiring them all up front.  If you
-are acquiring any later, and you detect a deadlock (ie, a loop of
-threads blocking on locks held by each other), you *must* be able to
-tell one of them to "ramp off" to holding no locks at all.  ie,
-ROLLBACK :).
-
-The bugger is, winding back execution state automatically to when you
-last started acquiring locks is probably an intractable problem from
-an interpreter's perspective...
-
-Sounds like a job for an exception to me ;-).
-
-  for (1..50) {
-     eval {
-        func_that_acquires_first_lock($args);
-     };
-     last unless $@ and $@ =~ m/mutex deadlock/i;
-  } 
-}}
-
-=item Automatic PMC sharing will be provided
-
-When a PMC is placed into a container which is shared (including
-lexical pads and global namespaces) then that PMC will automatically
-be marked as shared. It is acceptable for this to trigger an
-exception if for some reason a PMC should not be shared between
-interpreters.
-
-PMCs are, by default, not shared. This avoids sharing overhead for
-PMCs which are only used as temporaries and not shared. (Note that
-this is dangerous, and may end up not being done, due to the sharing
-of continuations)
-
-{{ - Luke: I don't know why this is dangerous.  A continuation is a data 
structure
-just like an array or a hash.  When you share it, everything "inside" it
-gets shared, too.  For a continuation, this could be an awful lot of
-stuff, but it's the safe way. 
-- Dan: True, but the danger part is if we don't mark everything grabbed by a
-continuation as shared by default. If we do, we might as well mark everything
-as shared, as there'll be less overhead. }}
-
-=item All interpreter constructs in a pool are shareable
-
-This means that a PMC or string may be used by any interpreter in a
-pool. It additionally means that, if full sharing is enabled, that
-any interpreter in a pool may invoke a continuation, assuming the
-continuation is valid. (That is, a continuation taken at parrot's top
-level. Continuations taken within vtable functions, user-defined ops,
-or extension code may not be shareable)
-
-=item The embedding API will allow posting events to a pool
-
-Many events are interpreter-specific, often caused by one particular
-interpreter requesting an async event that later completes.
-
-=item The embedding API will allow posting events to an interpreter
-
-For events that don't have to go to any particular interpreter, they
-can go into the pool's event loop.
-
-=item The embedding API will allow calling a parrot sub with a pool
-
-In those cases where there is an interpreter pool, embedders may call
-a parrot sub using the pool as a whole, rather than an individual
-interpreter, to run the sub. In that case Parrot may either choose a
-dormant interpreter (if there is one) or create a new interpreter in
-the pool to run the subroutine.
-
-When the sub is done, Parrot may either cache the created
-interpreter or destroy it as it needs to, though in no case will
-Parrot ever leave a pool with no interpreters at all.
+  $P0 = new 'Task'
+  # set attributes
+  schedule $P0
 
-=back 
+Register a task with the concurrency scheduler. Details about the task are
+stored within the task PMC.
 
-=head1 QUESTIONS
+=item join
 
-=over 4
+  $P0 = new 'Task'
+  # ... schedule the task, etc.
+  join $P0
 
-=item *
+Wait for a particular task to complete.
 
-Do we need a concurrent version of C<runinterp>? Needs to accept an interp and
-sub argument, as well as an event sub to throw on returning. Also returns a
-status object.
+=item kill
 
-=back
+  $P0 = new 'Task'
+  # ... schedule the task, etc.
+  kill $P0
 
-=head1 IMPLEMENTATION
+Kill a task without waiting for it to complete.
 
-[Excerpt from Perl 6 and Parrot Essentials to seed discussion.]
 
-Threads are a means of splitting a process into multiple pieces that
-execute simultaneously.  It's a relatively easy way to get some
-parallelism without too much work. Threads don't solve all the
-parallelism problems your program may have. Sometimes multiple
-processes on a single system, multiple processes on a cluster, or
-processes on multiple separate systems are better. But threads do
-present a good solution for many common cases.
-
-All the resources in a threaded process are shared between threads.
-This is simultaneously the great strength and great weakness of
-threads. Easy sharing is fast sharing, making it far faster to
-exchange data between threads or access shared global data than to
-share data between processes on a single system or on multiple
-systems. Easy sharing is dangerous, though, since without some sort of
-coordination between threads it's easy to corrupt that shared data.
-And, because all the threads are contained within a single process, if
-any one of them fails for some reason the entire process, with all its
-threads, dies.
-
-With a low-level language such as C, these issues are manageable. The
-core data types, integers, floats, and pointers are all small enough
-to be handled atomically. Composite data can be protected with
-mutexes, special structures that a thread can get exclusive access to.
-The composite data elements that need protecting can each have a mutex
-associated with them, and when a thread needs to touch the data it
-just acquires the mutex first. By default there's very little data
-that must be shared between threads, so it's relatively easy, barring
-program errors, to write thread-safe code if a little thought is given
-to the program structure.
-
-Things aren't this easy for Parrot, unfortunately. A PMC, Parrot's
-native data type, is a complex structure, so we can't count on the
-hardware to provide us atomic access. That means Parrot has to provide
-atomicity itself, which is expensive. Getting and releasing a mutex
-isn't really that expensive in itself. It has been heavily optimized by
-platform vendors because they want threaded code to run quickly. It's
-not free, though, and when you consider that running flat-out Parrot
-does one PMC operation per 100 CPU cycles, even adding an additional 10
-cycles per operation can slow Parrot down by 10%.
-
-For any threading scheme, it's important that your program isn't
-hindered by the platform and libraries it uses. This is a common
-problem with writing threaded code in C, for example. Many libraries
-you might use aren't thread-safe, and if you aren't careful with them
-your program will crash. While we can't make low-level libraries any
-safer, we can make sure that Parrot itself won't be a danger. There is
-very little data shared between Parrot interpreters and threads, and
-access to all the shared data is done with coordinating mutexes. This
-is invisible to your program, and just makes sure that Parrot itself
-is thread-safe.
-
-When you think about it, there are really three different threading
-models. In the first one, multiple threads have no interaction among
-themselves. This essentially does with threads the same thing that's
-done with processes. This works very well in Parrot, with the
-isolation between interpreters helping to reduce the overhead of this
-scheme. There's no possibility of data sharing at the user level, so
-there's no need to lock anything.
-
-In the second threading model, multiple threads run and pass messages
-back and forth between each other. Parrot supports this as well, via
-the event mechanism. The event queues are thread-safe, so one thread
-can safely inject an event into another thread's event queue. This is
-similar to a multiple-process model of programming, except that
-communication between threads is much faster, and it's easier to pass
-around structured data.
-
-In the third threading model, multiple threads run and share data
-between themselves. While Parrot can't guarantee that data at the user
-level remains consistent, it can make sure that access to shared data
-is at least safe. We do this with two mechanisms.
-
-First, Parrot presents an advisory lock system to user code. Any piece
-of user code running in a thread can lock a variable. Any attempt to
-lock a variable that another thread has locked will block until the
-lock is released. Locking a variable only blocks other lock attempts.
-It does I<not> block plain access. This may seem odd, but it's the
-same scheme used by threading systems that obey the POSIX thread
-standard, and has been well tested in practice.
-
-Secondly, Parrot forces all shared PMCs to be marked as such, and all
-access to shared PMCs must first acquire that PMC's private lock. This
-is done by installing an alternate vtable for shared PMCs, one that
-acquires locks on all its parameters. These locks are held only for
-the duration of the vtable function, but ensure that the PMCs affected
-by the operation aren't altered by another thread while the vtable
-function is in progress.
+=back
 
 =head1 ATTACHMENTS
 

Reply via email to