[rust-dev] threads, tasks, scheduling, migrating, failing

Graydon Hoare Wed, 25 May 2011 10:53:44 -0700

Hi,

We had a brief discussion on IRC yesterday that ended in me storming offin a very unprofessional manner. I'd like to publicly apologize for thatbehavior, it was not cool and had little to do with the conversation athand. My stress level was very high coming into work yesterday, and Iwas letting personal life spill into work life. My fault, sorry.

I'd also like to restart (or at least restate) parts of that discussionhere so we can actually get this worked out to everyone's satisfaction(including Rafael, who clearly has strong feelings on the matter). Thiswill be a long rambly email full of back-story to set the stage; if youhave specific points to follow-up on, just snip those parts out for yourreplies.



Preface
~~~~~~~

We know (or at least, anyone who's poked at it knows) that the taskingand threading model in rustboot was pretty unsatisfactory. It passedsome interesting tests ("millions of tasks, real cheap!") and had anumber of needs it was trying to meet -- it was not designed in*complete* ignorance -- but it also imposed strange burdens that we'dlike to dispense with this time around. Hopefully without losing thegood stuff.

So, some background "goals" in order to make this story make sense. Thethree primary design pressures are:

(1) Support *isolation*, in some useful sense, between tasks. Thatis, a task should be able to reason locally about its data and codewithout worrying whether other tasks are mucking with their own data andcode. With the exception of message-IO points and unsafe blocks, whichobviously involve the potential for non-isolated action.

(2) Support *lots* of tasks. Millions. Such that a programmer has nofear about making "too many" tasks in a system, if it decomposes nicely.If for no other reason than support for isolation-based local reasoning(though concurrent, or interleaved, or truly parallel execution is alsonice to exploit whenever it's surfaced this way).

(3) Run at relatively high, but more importantly *predictable*performance. As few magical parts as possible in the concurrency model.Take the M:N performance-penalty if necessary to achieve the other 2goals, so long as there are no random peformance discontinuities in themodel.

Concurrency model is intimately connected to memory model, unwinding,gc, and several other things; so when I say we're going to be revisitingdesign decisions in rustboot's concurrency model, I implicitly includeparts of the memory model too, and other parts.



The Past (rustboot)
~~~~~~~~~~~~~~~~~~~

The "lots of tasks" pressure breaks down into two sub-issues: makingtasks small (in the sense of memory) and making them independentlyscheduled. We approached the "small" issue via growable stacks (doublingvectors with pointer-rewriting) and a very large dose of ugly magic fordoing calls "between stacks" (from rust to C). This had lots ofunfortunate fallout: debuggers and tools got upset, calling back intorust code from C was mostly impossible, and to support it safely we'dneed to be flushing pointers to the stack and re-reading them*constantly*, much more than just "make sure values are pinnedsomewhere" necessary for GC. We approached the "scheduling" issue byeven *more* magic return-address patching during suspended C calls, anda custom cooperative scheduler.

The "isolation" pressure was approached by stratifying the heap memorymodel into private and shared (between-tasks) memory, with the sharedstuff always immutable and acyclic. Sharing was not always possible --not between threads and not between processes -- but betweentasks-in-a-thread it could work, and we figured that scenario wasvaluable to users. Cheap sending of shared bits between tasks in athread. Then we'd do a deep copy when we hit domain boundaries.

But to support that scenario, tasks had to be pinned to threads. Thatis, the concurrency scheme in rustboot involved tasks running withindomains (threads or processes; though the latter never materialized),where the user explicitly constructed domains and injected threads intothe domain. Once spawned in a domain, a task could not leave it (bemigrated to another thread), because it might have pointers into the"shared memory" of the domain. This pinning to domains has theunfortunate performance characteristic of *requiring* a user to payattention to task:thread assignments; this could be a benefit in somecases -- explicit control can be good -- but in many cases it seemedbad, or at least over-complex. It's not just a matter of "having an M:Nscheduler in userspace" (which will, says the literature, alwaysunderperform a 1:1 scheduler with the kernel involved) but also pinningeach individual task in M to a thread in N, such that one task blocking(or even just monopolizing a core) could block or slow down a wholegroup of tasks on the same thread. This is a usability hazard.



The Future (rustc)
~~~~~~~~~~~~~~~~~~

Rustboot is dead (yay!) and we're in the process of working through theleftover cruft in the runtime and removing parts which were bad ideasand/or only around to support rustboot's various limitations and designchoices. Rustc doesn't really *have* a tasking system yet -- there arepieces of the communication layer, but eholk is just getting spawnworking this week -- so we're mostly doing "rewrite this subsystem" worknow. There's been a fair amount of disjointed conversation over thismatter, I'm hoping to consolidate what's on the agenda here.


The "lots of tasks" issue still has two sub-parts: size and scheduling.

We're going to approach "size" via "the other, more standard technique",which is to use a linked list of stack segments and never move them. Wewill most likely reuse the *exact* technique Go is using here, in thesense of trying to be ABI compatible (at least insofar as this ispossible or makes sense) and possibly even use the same runtime supportlibrary. This approach is easier for LLVM to cope with (there's a GSoCstudent implementing it currently), and more tools understand it. Italso makes stack segments recyclable between tasks, which should reduceoverall memory pressure (equivalent to "shrinking" in our other model).We're also going to use the same "unified" approach to growth andcross-language calling as Go uses -- just grow into a "sufficiently big"segment that may be recycled between tasks in between C calls -- andthat may well permit C to call back into rust (assuming it can provide atask* and can be made to play nice with unwinding and GC; see below).

We're also going to approach "scheduling" via "the other, more standardtechnique", which is to use the posix (and before that, system V)<ucontext.h> schedulable user contexts and (sadly) again our ownscheduler. Where ucontext isn't OS-provided, we'll provide our ownimplementation; it's not actually much more than a "save registers tostructure A and load them from structure B" routine anyways, just with astandard API. And on some OSs -- specifically those where we discoverthreads are sufficiently cheap, if running on small stacks -- we'regoing to lean much more heavily on the OS thread scheduler. See below.

We're going to approach the "isolation" pressure differently. Ratherthan permit tasks to share pointers at all, we'll be shifting to astratification of memory based on unique pointers. This will mean thatthe only possible kinds of send are "move" and "deep copy". Move willhappen everywhere in-OS-process, deep-copy between processes. Movesemantics -- making a copy while indivisibly de-initializing the sourceof the copy -- are going to be the focus of a fair bit of work over thenext while, and we're betting on them for the messaging/isolation system.

Along with minimizing refcounting (and a host of other thorny semanticissues associated with accidental copying, such as environment captureand double-execution of destructors) this will permit tasks to migratebetween threads. Or, put more simply, it will permit us to treat threadsas an undifferentiated pool of N workers, and tasks as a pool of M workunits; when a thread blocks in C code it will have no affect on theother tasks (other than temporarily using up a "large segment" of stack)and M>N runnable tasks should always "saturate" the threads (thus cores)with work. Moreover, when we're on an OS that has "really cheap threads"we can spin up N == M threads and fall into the case of 1:1 scheduling:back off and let the OS kernel do all the scheduling, have our scheduleralways say "oh, just keep running the same task you're running" everytime it checks at a yield point. On linux, for example, I believe thatthis may very well be more optimal than getting in the way with our ownscheduler and ucontext logic. I'm not sure about other OSs but I'd liketo retain this "dial" to be able to dial scheduling back into ourruntime if we're on an OS with horrendously bad or otherwise expensivekernel threads.

In the process of this change, we'll eliminate the concept of a domainfrom the language and runtime, and just model OS processes as OSprocesses. We'll still need some runtime support for interacting withsubprocesses, of course, just avoid trying to mix metaphors.

Moving to a pool-of-threads model should also permit leaning on "theother, more standard technique" for unwinding: the C++ unwinder (or alarge part of it). Since a task blocked-in-C doesn't necessarily blockany other tasks (they can run on other threads) we don't need todeschedule the unwinder, which was a large part of my concern for how itmight be unusable in this role. We can just let unwinding run tocompletion (scheduler yield-points can always opt to not-yield whenunwinding). A secondary concern has to do with double-faulting (failingwithin a destructor) but we'll cross that bridge when we come to it; Idoubt the policy to call terminate() in C++ is so hard-wired into theunwinder that there are no possible ways of overriding it. Opinions onthis welcome.

(Incidentally, the more we drift away from "our own per-frame metadatatables" towards relying on stock components, the more I think using aconservative stack scanner for GC root-finding may be perfectly fine,obviating the need for per-frame GC info and explicit GC-safe points.But that's not terribly related to choice of concurrency strategy; justan interesting note for anyone following along the Saga Of Rust Frame Info)

Our argument yesterday hit a breaking point when we were discussing therelationship between C++ unwind semantics and pthread cancellation. Ithink this was actually a red herring: 'fail' could only really map to'pthread_cancel' in the case that we were hard-wired to a 1:1 schedulingmodel, always using a kernel thread for a task, which as I've said Iwould like to retain as only as an option (when on an OS with cheap /fast kernel threads) rather than a semantic requirement. And even if we*did* swallow that requirement, I was arguing yesterday (and I believefurther googling today supports) the contention that pthread_cancel isjust not a good fit for 'fail' (or kill). It will:


  (a) Only cancel at cancellation points, not "any instruction"; so it's
      probing the presence of a flag in the pthread library anyways. Not
      terribly different, cost-wise, from using our own flag.

  (b) Not run the C++ unwinder in a reliable, portable fashion;
      on some platforms this works but on some it does not, and boost
      has opted to not-present the interface for precisely this reason.

Overall I don't think pthread_cancel was a productive avenue for thediscussion and I'm sorry we wound up in it. I think it's more useful totalk about ways to make cooperative scheduling (== kill) points cheapenough to live with -- including options for unsafe loops that omit them-- even in the degenerate case where they always say "keep running thetask you're running" (1:1 mapping to threads). At least until killed.


Phew! Ok. Done. Comments? Preferences? Field datapoints to contribute?

-Graydon
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

[rust-dev] threads, tasks, scheduling, migrating, failing

Reply via email to