I owe a bunch of people here an apology. There is a *lot* to know in this space. Sometimes it gets overwhelming, I get stuck, and I start refusing input. This can lead me to be pretty stupid. The only saving grace is that I eventually get embarrassed enough to catch up on the literature and un-stick myself.
We've had an ongoing discussion about garbage collection (GC) here, notably focusing on real-time requirements. There has been a side discussion on reference counting (RC), and a second side discussion on regions. I'm afraid I rejected RC on the basis of its notoriously bad performance, mainly because I didn't know about the RC-immix work. I think I need to start this discussion over now that I'm listening. Let me begin by trying to characterize what I now think I know. I strongly encourage that everybody read the paper on Immix<http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf>, and then the paper on RC-immix<http://research.microsoft.com/pubs/202163/rcix-oopsla-2013.pdf>, in that order. I found the RC-immix paper amusing, because when I read their earlier paper<http://users.cecs.anu.edu.au/~steveb/downloads/pdf/rc-ismm-2012.pdf>on reference counting, my first thought was that they had to be eating 10% or so in arena management. Sure enough, that was correct. But an additional reason to read the two immix papers is that they lay out a consistent terminology for discussion. There are three issues to consider in any storage allocation/reclamation approach: total CPU overhead, total memory overhead (measured in required real pages, not heap size) and pause times. For our purposes, CPU overhead can reasonably be approximated by analysis of L1 and L2 cache misses. TLB misses are also important, and the literature I have seen doesn't seem to examine that number. The TLB is generally a *much* smaller cache than the I- or D-cache, and the cost of a TLB miss is often higher than that of an I- or D-cache miss. *Overheads* * * Let me start with a summary of what has been reloaded into my head at this point: 1. Any scheme that uses a *k*-sized free list mechanism naively is subject to both immediate overhead and fragmentation. The overhead arises from the need to manage the size-based free lists. Not so much the lists themselves as the policy decisions about coalescing those lists. The fragmentation arises because operations that free objects do not obey allocation order. This causes the arena's locality to become progressively less aligned with allocation order. Absent dynamic feedback, the best predictor we have for access locality is still allocation order. The elimination of a k-sized free list mechanism is the main advantage of compacting and evacuating collectors. 2. The object release rate is bounded (from above) by the number of pointer overwrites. This is true because *some* pointer overwrites cause objects to become unreferenced. Unfortunately this is a very *loose* bound. There is very compelling data that few old objects are ever released, and that *when* released, they are released in bulk. The cost of an old-space mark pass is tightly proportional to the size of the old-space heap. The cost of reference counting is tightly proportional to the rate of pointer overwrites. 3. *All* GC and RC systems require some form of mark pass that is linear in the size of the heap. GC systems require this as their baseline mechanism. RC systems require it for cycle collection (yes, I know about trial unreference techniques - I'm trying to stick to practical options here). In RC systems, the expected yield of the mark pass is much lower than in GC systems, because the majority of releases are accomplished promptly by RC. The *cost* of the mark (and possibly sweep) pass comes in multiple forms: CPU cost, cache pollution, power consumption, and (in some cases) delay of the mutator thread(s). 4. The pause times associated with collections in new space are small, and mainly tolerable. Locality concerns are much reduced in new space, mainly because the entirety of new space tends to fit in L1 cache. There are several sensible algorithms for new space, some of which are concurrent. While we need to choose a (family of) algorithm(s) with care, this does not seem to be the thing that is really in the way of real-time subsystems. I suspect it is useful for the application to be able to say, explicitly, "this is a long-lived allocation". 5. In systems that have any sort of real-time requirement, concurrent collection is really the only game in town. 6. The RC-immix work has brought reference counting back into the realm where it can reasonably be considered. The big win is that it reclaims much of the locality that has previously been achieved only in compacting/evacuating collectors. 7. In GC systems, there is a potentially long delay between the time an object is unreachable and the time its storage can be reused. This delay becomes progressively higher as the generation gets older. In RC systems, this delay is present for *cyclic* objects, and for the small proportion of objects whose reference count reaches a so-called "sticky" count. 8. *Every GC system has a bad case*. Even for purely concurrent GC's, there is a relationship between CPU overhead, heap size, release rate, and allocation rate. If the behavior of the application goes "outside the zone", real-time guarantees are lost. I don't see this as bad, because in the real world *all* real-time guarantees are a function of available resource and tolerable cost. The problem I have with this is that the envelope is rarely characterized well enough to do any good. 9. An under-noted advantage of RC is that it does not (or at least *should* not) incur the real-DRAM overheads of GC. This is a very important thing. *Relevance of Regions* * * I still have an intuition that regions are relevant - or at least, *first class* regions are relevant. It is often true that developers *know* when they are allocating cyclic structures, and that the cyclic nature of the structure is durable. That is: once attached to the graph, objects aren't released until the whole graph is released. It still seems possible to me that we can make use of static checking to ensure that the developer's belief is correct. Having done so, it seems to me that we can exploit the result for GC purposes. For example, if we know about some object "all pointers to this object come from within region R, except for borrowed pointers and root pointers", we have effectively told the GC that (a) these objects all belong in a common generation, and (b) for cycle detection purposes, we don't need to do a full mark pass. *Relevance of Borrowed Pointers* * * I'm still confused about how far borrowed pointers go in Rust, but the key point is that they are "invisible to the collector". To the extent that this is true, borrowed pointers *should* also be invisible for reference count maintenance purposes. Since the main source of hassle with reference counting is pointer overwrite, I think it's very important for us to understand the type rules for borrowed pointers. I have a sneaking suspicion that the idea of "store once" pointers is very important as well. The distinction in my mind between "store once" and "init only" is that a store-once pointer can be transitioned from NULL to an object reference by a store operation after initialization has occurred. The reason this matters is that such a pointer cannot cease to exist by virtue of overwrite, which means that local copies of the value are more readily observed to be borrowed. More generally this leads me to notice that we need to be doing local (conservative) liveness inference. *Opportunistic Relocation* * * One attraction to (most) RC schemes is that objects don't get relocated. This makes it *much* easier to interact successfully with libraries written in C/C++. So much so that some amount of performance can and should be sacrificed in the interest of interoperability. A key reason the immix work succeeds is that it does opportunistic relocation, which would appear (at first glance, anyway) to defeat interoperability with C/C++. The immix approach, however, degrades gracefully in the presence of pinned objects. As long as the pins happen properly, C and C++ code can reference objects in the GC heap. An interesting question here becomes: could we hybridize C to sandbox "unsafe" references and introduce a new class of "memory safe" references? Things like this have been tried before, obviously. *Summary* * * I'm inclined, at this point, to adopt something like RC-immix as the baseline for storage management in BitC. The one piece that I don't (yet) feel I have a handle on is whether modern RC methods degrade, and if so how badly, on weakly ordered memory subsystems. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
