Hi,

We had a brief discussion on IRC yesterday that ended in me storming off in a very unprofessional manner. I'd like to publicly apologize for that behavior, it was not cool and had little to do with the conversation at hand. My stress level was very high coming into work yesterday, and I was letting personal life spill into work life. My fault, sorry.

I'd also like to restart (or at least restate) parts of that discussion here so we can actually get this worked out to everyone's satisfaction (including Rafael, who clearly has strong feelings on the matter). This will be a long rambly email full of back-story to set the stage; if you have specific points to follow-up on, just snip those parts out for your replies.


Preface
~~~~~~~

We know (or at least, anyone who's poked at it knows) that the tasking and threading model in rustboot was pretty unsatisfactory. It passed some interesting tests ("millions of tasks, real cheap!") and had a number of needs it was trying to meet -- it was not designed in *complete* ignorance -- but it also imposed strange burdens that we'd like to dispense with this time around. Hopefully without losing the good stuff.

So, some background "goals" in order to make this story make sense. The three primary design pressures are:

(1) Support *isolation*, in some useful sense, between tasks. That is, a task should be able to reason locally about its data and code without worrying whether other tasks are mucking with their own data and code. With the exception of message-IO points and unsafe blocks, which obviously involve the potential for non-isolated action.

(2) Support *lots* of tasks. Millions. Such that a programmer has no fear about making "too many" tasks in a system, if it decomposes nicely. If for no other reason than support for isolation-based local reasoning (though concurrent, or interleaved, or truly parallel execution is also nice to exploit whenever it's surfaced this way).

(3) Run at relatively high, but more importantly *predictable* performance. As few magical parts as possible in the concurrency model. Take the M:N performance-penalty if necessary to achieve the other 2 goals, so long as there are no random peformance discontinuities in the model.

Concurrency model is intimately connected to memory model, unwinding, gc, and several other things; so when I say we're going to be revisiting design decisions in rustboot's concurrency model, I implicitly include parts of the memory model too, and other parts.


The Past (rustboot)
~~~~~~~~~~~~~~~~~~~

The "lots of tasks" pressure breaks down into two sub-issues: making tasks small (in the sense of memory) and making them independently scheduled. We approached the "small" issue via growable stacks (doubling vectors with pointer-rewriting) and a very large dose of ugly magic for doing calls "between stacks" (from rust to C). This had lots of unfortunate fallout: debuggers and tools got upset, calling back into rust code from C was mostly impossible, and to support it safely we'd need to be flushing pointers to the stack and re-reading them *constantly*, much more than just "make sure values are pinned somewhere" necessary for GC. We approached the "scheduling" issue by even *more* magic return-address patching during suspended C calls, and a custom cooperative scheduler.

The "isolation" pressure was approached by stratifying the heap memory model into private and shared (between-tasks) memory, with the shared stuff always immutable and acyclic. Sharing was not always possible -- not between threads and not between processes -- but between tasks-in-a-thread it could work, and we figured that scenario was valuable to users. Cheap sending of shared bits between tasks in a thread. Then we'd do a deep copy when we hit domain boundaries.

But to support that scenario, tasks had to be pinned to threads. That is, the concurrency scheme in rustboot involved tasks running within domains (threads or processes; though the latter never materialized), where the user explicitly constructed domains and injected threads into the domain. Once spawned in a domain, a task could not leave it (be migrated to another thread), because it might have pointers into the "shared memory" of the domain. This pinning to domains has the unfortunate performance characteristic of *requiring* a user to pay attention to task:thread assignments; this could be a benefit in some cases -- explicit control can be good -- but in many cases it seemed bad, or at least over-complex. It's not just a matter of "having an M:N scheduler in userspace" (which will, says the literature, always underperform a 1:1 scheduler with the kernel involved) but also pinning each individual task in M to a thread in N, such that one task blocking (or even just monopolizing a core) could block or slow down a whole group of tasks on the same thread. This is a usability hazard.


The Future (rustc)
~~~~~~~~~~~~~~~~~~

Rustboot is dead (yay!) and we're in the process of working through the leftover cruft in the runtime and removing parts which were bad ideas and/or only around to support rustboot's various limitations and design choices. Rustc doesn't really *have* a tasking system yet -- there are pieces of the communication layer, but eholk is just getting spawn working this week -- so we're mostly doing "rewrite this subsystem" work now. There's been a fair amount of disjointed conversation over this matter, I'm hoping to consolidate what's on the agenda here.

The "lots of tasks" issue still has two sub-parts: size and scheduling.

We're going to approach "size" via "the other, more standard technique", which is to use a linked list of stack segments and never move them. We will most likely reuse the *exact* technique Go is using here, in the sense of trying to be ABI compatible (at least insofar as this is possible or makes sense) and possibly even use the same runtime support library. This approach is easier for LLVM to cope with (there's a GSoC student implementing it currently), and more tools understand it. It also makes stack segments recyclable between tasks, which should reduce overall memory pressure (equivalent to "shrinking" in our other model). We're also going to use the same "unified" approach to growth and cross-language calling as Go uses -- just grow into a "sufficiently big" segment that may be recycled between tasks in between C calls -- and that may well permit C to call back into rust (assuming it can provide a task* and can be made to play nice with unwinding and GC; see below).

We're also going to approach "scheduling" via "the other, more standard technique", which is to use the posix (and before that, system V) <ucontext.h> schedulable user contexts and (sadly) again our own scheduler. Where ucontext isn't OS-provided, we'll provide our own implementation; it's not actually much more than a "save registers to structure A and load them from structure B" routine anyways, just with a standard API. And on some OSs -- specifically those where we discover threads are sufficiently cheap, if running on small stacks -- we're going to lean much more heavily on the OS thread scheduler. See below.

We're going to approach the "isolation" pressure differently. Rather than permit tasks to share pointers at all, we'll be shifting to a stratification of memory based on unique pointers. This will mean that the only possible kinds of send are "move" and "deep copy". Move will happen everywhere in-OS-process, deep-copy between processes. Move semantics -- making a copy while indivisibly de-initializing the source of the copy -- are going to be the focus of a fair bit of work over the next while, and we're betting on them for the messaging/isolation system.

Along with minimizing refcounting (and a host of other thorny semantic issues associated with accidental copying, such as environment capture and double-execution of destructors) this will permit tasks to migrate between threads. Or, put more simply, it will permit us to treat threads as an undifferentiated pool of N workers, and tasks as a pool of M work units; when a thread blocks in C code it will have no affect on the other tasks (other than temporarily using up a "large segment" of stack) and M>N runnable tasks should always "saturate" the threads (thus cores) with work. Moreover, when we're on an OS that has "really cheap threads" we can spin up N == M threads and fall into the case of 1:1 scheduling: back off and let the OS kernel do all the scheduling, have our scheduler always say "oh, just keep running the same task you're running" every time it checks at a yield point. On linux, for example, I believe that this may very well be more optimal than getting in the way with our own scheduler and ucontext logic. I'm not sure about other OSs but I'd like to retain this "dial" to be able to dial scheduling back into our runtime if we're on an OS with horrendously bad or otherwise expensive kernel threads.

In the process of this change, we'll eliminate the concept of a domain from the language and runtime, and just model OS processes as OS processes. We'll still need some runtime support for interacting with subprocesses, of course, just avoid trying to mix metaphors.

Moving to a pool-of-threads model should also permit leaning on "the other, more standard technique" for unwinding: the C++ unwinder (or a large part of it). Since a task blocked-in-C doesn't necessarily block any other tasks (they can run on other threads) we don't need to deschedule the unwinder, which was a large part of my concern for how it might be unusable in this role. We can just let unwinding run to completion (scheduler yield-points can always opt to not-yield when unwinding). A secondary concern has to do with double-faulting (failing within a destructor) but we'll cross that bridge when we come to it; I doubt the policy to call terminate() in C++ is so hard-wired into the unwinder that there are no possible ways of overriding it. Opinions on this welcome.

(Incidentally, the more we drift away from "our own per-frame metadata tables" towards relying on stock components, the more I think using a conservative stack scanner for GC root-finding may be perfectly fine, obviating the need for per-frame GC info and explicit GC-safe points. But that's not terribly related to choice of concurrency strategy; just an interesting note for anyone following along the Saga Of Rust Frame Info)

Our argument yesterday hit a breaking point when we were discussing the relationship between C++ unwind semantics and pthread cancellation. I think this was actually a red herring: 'fail' could only really map to 'pthread_cancel' in the case that we were hard-wired to a 1:1 scheduling model, always using a kernel thread for a task, which as I've said I would like to retain as only as an option (when on an OS with cheap / fast kernel threads) rather than a semantic requirement. And even if we *did* swallow that requirement, I was arguing yesterday (and I believe further googling today supports) the contention that pthread_cancel is just not a good fit for 'fail' (or kill). It will:

  (a) Only cancel at cancellation points, not "any instruction"; so it's
      probing the presence of a flag in the pthread library anyways. Not
      terribly different, cost-wise, from using our own flag.

  (b) Not run the C++ unwinder in a reliable, portable fashion;
      on some platforms this works but on some it does not, and boost
      has opted to not-present the interface for precisely this reason.

Overall I don't think pthread_cancel was a productive avenue for the discussion and I'm sorry we wound up in it. I think it's more useful to talk about ways to make cooperative scheduling (== kill) points cheap enough to live with -- including options for unsafe loops that omit them -- even in the degenerate case where they always say "keep running the task you're running" (1:1 mapping to threads). At least until killed.

Phew! Ok. Done. Comments? Preferences? Field datapoints to contribute?

-Graydon
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to