OK. It's time for me to step back from worrying about GC design and get
back to actual work. I've internalized a bunch of stuff and I've managed to
catch up on the last 15 years or so of the literature (almost), and I've
gotten a lot from all of you in the process. I'm going to send Tony and
Eliot a note arising from this discussion, and try to encourage a more
focused discussion of read barriers, forwarding, and concurrent design
tradeoffs in the next edition of the *The Garbage Collection Handbook*. For
now, let me try to run down where I am.

I'm very impressed by the current state of RC-collectors, and I've talked
elsewhere about how this approach opens up a bunch of design options and
expressiveness options. The key to variable-weight and/or
performance-isolated collection is to introduce [possibly dynamic] heap
membranes, and keep track of the object references that cross membrane
boundaries. Other techniques can do it, but hybrid-RC designs seem to do it
without introducing multiplication of mechanism. Especially so when some
variation of the immix heap organization is adopted.

We've talked a lot about GC performance, and I've come to feel that we are
using the wrong metrics. Safe language throughput seems to be a mostly
solved problem at this point, by which I mean both that (a) the difference
relative to explicit storage management is now very small, and (b) any
inherent cost of safety is now low enough that I'm willing to pay it, and I
think a lot of other people are too. The question of pause times is a
different matter, and I've become convinced that by focusing on pause times
we have been framing the question the wrong way. I think we should be
focusing on three variables:

   - A time interval (T). Moving closer to hard real-time applications may
   require shorter intervals.
   - Mutator *progress* during each time interval (P). Not just "what
   percentage of time is spent in the mutator" but also "How much real-world
   computational progress is it making?" Spending 10% more time in the mutator
   while adding 12% overhead to the mutator isn't winning.
   - *Variance* of that progress over that time interval (V)

We would like flexibility in choosing T, possibly at some cost in the rate
of progress P, the variance of progress V, or both. Unfortunately our
collection needs are also bounded by allocation rate and mutate rate (e.g.
through logging), so this isn't a complete characterization. If we can
arrive at a solution that makes these things somewhat tunable, I don't
think we should be worrying about the last 2% of throughput performance.

On parallel (as opposed to concurrent) collectors we are now looking at
<50ms pause times on small-scale multiprocessors, with 10% or less (typ) of
total time spent in storage management [*] While we incur both code bloat
and execution overhead in barrier code, that is made up (relative to
conventional allocators) in reduced L1 cache miss rates. The code bloat is *
also* made up (usually) by decreases in the overhead information that must
be recorded on each allocation. I've been talking, in the background, to
people I know who build production video games and codecs. Faced with the
prospect of a 50ms pause, they mostly didn't care. Several of them noted
that current context switch and page fault times often exceed this already,
though the "all threads" aspect of the pause may have implications they
weren't considering.  *It is not obvious that on-the-fly is better for
real-time applications, nor is it obvious that on-the-fly is required for
real-time applications.*

[*] The exception is applications that bulk allocate and bulk deallocate
durable structures, Xalan being a notable example. I believe we all agree
that there are cases where the programmer knows what their needs are, and
the allocator could cooperate with those needs if a means existed to
express them. This won't solve all cases, but it will solve a lot of them.


Almost all of the mutator performance in these discussions hinges on read
barriers: do we need them, and if so, do they need to be fine-grained (i.e.
inlined with the load instructions) or can they be done with page
protections? Here are the answers I have come to (so far):

   1. Read barriers of some form are required on any sub-heap where
   collectors and mutators may operate concurrently and evacuation/relocation
   may be performed. This means that it is *extremely* desirable to use
   thread-local nursery allocation.
   2. Fine-grain read barriers are *sometimes *required if the mutator
   participates cooperatively in the maintenance of the tri-color (or K-color)
   invariant. It depends on the type of mutator.
   3. Fine-grain read barriers are required if heap regions that are
   subject to recoloring and/or evacuation are "reference hot", because the
   page fault approach isn't cost effective.
   4. Fine-grain read barriers are required in practice if nurseries are
   not thread-local.

Taken in combination, these conditions mean that read barriers are needed
for any on-the-fly collection scheme, and also for any fully concurrent (as
opposed to *mostly* concurrent) scheme.

I've noted in discussion that there are ways to emit code in which read
barriers can be turned on and off dynamically through low-level dynamic
translation. This prospect gives us a late binding option on the choice of
collector integration strategy, which strikes me as elegant. It also means
that we can separate many of the GC integration issues from the compiler
entirely. We can instead make our choices late on the basis of how hot the
forwarded objects are, what proportion of mutator execution time is
overlapped with collector execution, the dynamic frequency and dynamic
locality with which the conditions that *trigger* read barriers arise, and
our target objectives for P, V, and T.

I have been focusing a lot on relocating collectors, mainly because
evacuation of some form seems to be a necessary activity for heap
defragmentation. It isn't clear that this is really required. In the immix
design, for example, we're already prepared to do chunked bump allocation.
It is conceivable that a block could be considered "free" if it is "free
enough" without being totally cleared. If that is possible, then the need
for fine-grain read barriers goes away. It is also possible that we could
size-segregate the blocks in the tenured heap to control fragmentation. It
is also possible that inter-operation with C creates enough dynamic
restrictions on relocation (through pinning) that relocation isn't that
helpful. The pressure to relocate is a function of total heap size
pressure, which is inversely related to available system DRAM. An
unfortunate implication of this is that small platforms are more likely to
use read barriers (with their associated code bloat) than large platforms.

The practical benefit of on-the-fly collection (as opposed to concurrent
collection) seems uncertain. When all's done, there are only so many
hardware threads of execution available, and the activity of these hardware
threads is going to go back and forth between mutating and collecting. The
"flip" can be implemented voluntarily in the mutator or implicitly in the
operating system. The cost of doing it implicitly is (a) need for a
fine-grain read barrier, (b) loss of mutator control about when collection
happens, and (c) lost ability to drive collection in finer-grain
increments. The primary *advantage* of on-the-fly is that it allows
collection to proceed during long I/O waits.

Once a fine-grain read barrier is required, it is worth considering how
many of the checks previously performed by write barriers can instead be
moved to read barriers. It is also worth considering a notion of a
"read-write barrier". That is: combining the read and write barriers when
we are reading a reference to an object that we know we intend to write. I
haven't spent much time on either question; the high-order concern was to
determine whether a fine-grain read barrier was required at all.

Once a fine-grain read barrier is required, you have in some respects "paid
the piper", and you might as well get all of the advantage out of it that
you can.

Last, but not least, I note that one of the challenges in deferred
reference counting is that it makes on-the-fly collection a bit tricky. The
essence of it is that something analogous to a tri-color invariant needs to
be maintained on the deferred decrement log so that mutators can go through
a "phase agreement" with the collector. It could be that this is done using
the tri-color invariant itself.


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to