[bitc-dev] GC, RC, and real time.

Jonathan S. Shapiro Fri, 11 Oct 2013 10:49:43 -0700

I owe a bunch of people here an apology. There is a *lot* to know in this
space. Sometimes it gets overwhelming, I get stuck, and I start refusing
input. This can lead me to be pretty stupid. The only saving grace is that
I eventually get embarrassed enough to catch up on the literature and
un-stick myself.


We've had an ongoing discussion about garbage collection (GC) here, notably
focusing on real-time requirements. There has been a side discussion on
reference counting (RC), and a second side discussion on regions. I'm
afraid I rejected RC on the basis of its notoriously bad performance,
mainly because I didn't know about the RC-immix work. I think I need to
start this discussion over now that I'm listening. Let me begin by trying
to characterize what I now think I know. I strongly encourage that
everybody read the paper on
Immix<http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf>,
and then the paper on
RC-immix<http://research.microsoft.com/pubs/202163/rcix-oopsla-2013.pdf>,
in that order. I found the RC-immix paper amusing, because when I read
their earlier 
paper<http://users.cecs.anu.edu.au/~steveb/downloads/pdf/rc-ismm-2012.pdf>on
reference counting, my first thought was that they had to be eating
10%
or so in arena management. Sure enough, that was correct. But an additional
reason to read the two immix papers is that they lay out a consistent
terminology for discussion.

There are three issues to consider in any storage allocation/reclamation
approach: total CPU overhead, total memory overhead (measured in required
real pages, not heap size) and pause times. For our purposes, CPU overhead
can reasonably be approximated by analysis of L1 and L2 cache misses. TLB
misses are also important, and the literature I have seen doesn't seem to
examine that number. The TLB is generally a *much* smaller cache than the
I- or D-cache, and the cost of a TLB miss is often higher than that of an
I- or D-cache miss.

*Overheads*
*
*
Let me start with a summary of what has been reloaded into my head at this
point:

1. Any scheme that uses a *k*-sized free list mechanism naively is subject
to both immediate overhead and fragmentation. The overhead arises from the
need to manage the size-based free lists. Not so much the lists themselves
as the policy decisions about coalescing those lists. The fragmentation
arises because operations that free objects do not obey allocation order.
This causes the arena's locality to become progressively less aligned with
allocation order. Absent dynamic feedback, the best predictor we have for
access locality is still allocation order. The elimination of a k-sized
free list mechanism is the main advantage of compacting and evacuating
collectors.

2. The object release rate is bounded (from above) by the number of pointer
overwrites. This is true because *some* pointer overwrites cause objects to
become unreferenced. Unfortunately this is a very *loose* bound. There is
very compelling data that few old objects are ever released, and that
*when* released,
they are released in bulk. The cost of an old-space mark pass is tightly
proportional to the size of the old-space heap. The cost of reference
counting is tightly proportional to the rate of pointer overwrites.

3. *All* GC and RC systems require some form of mark pass that is linear in
the size of the heap. GC systems require this as their baseline mechanism.
RC systems require it for cycle collection (yes, I know about trial
unreference techniques - I'm trying to stick to practical options here). In
RC systems, the expected yield of the mark pass is much lower than in GC
systems, because the majority of releases are accomplished promptly by RC.

The *cost* of the mark (and possibly sweep) pass comes in multiple forms:
CPU cost, cache pollution, power consumption, and (in some cases) delay of
the mutator thread(s).

4. The pause times associated with collections in new space are small, and
mainly tolerable. Locality concerns are much reduced in new space, mainly
because the entirety of new space tends to fit in L1 cache. There are
several sensible algorithms for new space, some of which are concurrent.
While we need to choose a (family of) algorithm(s) with care, this does not
seem to be the thing that is really in the way of real-time subsystems. I
suspect it is useful for the application to be able to say, explicitly,
"this is a long-lived allocation".

5. In systems that have any sort of real-time requirement, concurrent
collection is really the only game in town.

6. The RC-immix work has brought reference counting back into the realm
where it can reasonably be considered. The big win is that it reclaims much
of the locality that has previously been achieved only in
compacting/evacuating collectors.

7. In GC systems, there is a potentially long delay between the time an
object is unreachable and the time its storage can be reused. This delay
becomes progressively higher as the generation gets older. In RC systems,
this delay is present for *cyclic* objects, and for the small proportion of
objects whose reference count reaches a so-called "sticky" count.

8. *Every GC system has a bad case*. Even for purely concurrent GC's, there
is a relationship between CPU overhead, heap size, release rate, and
allocation rate. If the behavior of the application goes "outside the
zone", real-time guarantees are lost. I don't see this as bad, because in
the real world *all* real-time guarantees are a function of available
resource and tolerable cost. The problem I have with this is that the
envelope is rarely characterized well enough to do any good.

9. An under-noted advantage of RC is that it does not (or at least
*should* not)
incur the real-DRAM overheads of GC. This is a very important thing.

*Relevance of Regions*
*
*
I still have an intuition that regions are relevant - or at least, *first
class* regions are relevant. It is often true that developers *know* when
they are allocating cyclic structures, and that the cyclic nature of the
structure is durable. That is: once attached to the graph, objects aren't
released until the whole graph is released. It still seems possible to me
that we can make use of static checking to ensure that the developer's
belief is correct. Having done so, it seems to me that we can exploit the
result for GC purposes. For example, if we know about some object "all
pointers to this object come from within region R, except for borrowed
pointers and root pointers", we have effectively told the GC that (a) these
objects all belong in a common generation, and (b) for cycle detection
purposes, we don't need to do a full mark pass.

*Relevance of Borrowed Pointers*
*
*
I'm still confused about how far borrowed pointers go in Rust, but the key
point is that they are "invisible to the collector". To the extent that
this is true, borrowed pointers *should* also be invisible for reference
count maintenance purposes. Since the main source of hassle with reference
counting is pointer overwrite, I think it's very important for us to
understand the type rules for borrowed pointers.

I have a sneaking suspicion that the idea of "store once" pointers is very
important as well. The distinction in my mind between "store once" and
"init only" is that a store-once pointer can be transitioned from NULL to
an object reference by a store operation after initialization has occurred.
The reason this matters is that such a pointer cannot cease to exist by
virtue of overwrite, which means that local copies of the value are more
readily observed to be borrowed.

More generally this leads me to notice that we need to be doing local
(conservative) liveness inference.

*Opportunistic Relocation*
*
*
One attraction to (most) RC schemes is that objects don't get relocated.
This makes it *much* easier to interact successfully with libraries written
in C/C++. So much so that some amount of performance can and should be
sacrificed in the interest of interoperability.

A key reason the immix work succeeds is that it does opportunistic
relocation, which would appear (at first glance, anyway) to defeat
interoperability with C/C++. The immix approach, however, degrades
gracefully in the presence of pinned objects. As long as the pins happen
properly, C and C++ code can reference objects in the GC heap.

An interesting question here becomes: could we hybridize C to sandbox
"unsafe" references and introduce a new class of "memory safe" references?
Things like this have been tried before, obviously.



*Summary*
*
*
I'm inclined, at this point, to adopt something like RC-immix as the
baseline for storage management in BitC.  The one piece that I don't (yet)
feel I have a handle on is whether modern RC methods degrade, and if so how
badly, on weakly ordered memory subsystems.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

[bitc-dev] GC, RC, and real time.

Reply via email to