I havent read the papers ( I will ) but my issue for reference counting has
always been concurency ..  eg

AtomicIncrement(&refPtr)

if (AtomicDecrement(&refPtr) - == 0 )
    delete

problem is the atomics become LOCK CMPXCHG  ,   now cmpxcg on single core
 ( or 2 CPU with shared L2 cache like Core2) is acceptable.. but the memory
lock on 4+ core multicore  or multi cpu boards makes me nervous ( very)
and i remember seeing a micro bench  ( I know..)  where this cost reached
50%.   ( I note the paper tested a single duel core but i havent decyphered
 the timings yet and 4 or 8 core  or even better 2/4 CPU boards in the test
 would have been nice to see how it scales)

Now if you can say the code is single threaded , if becomes a simple
 increment and you lose most of the 10%  cost , and hence its not the ref
counting but the concurrency requirement thats the real cost ..which along
with  bounds check elimination issues etc  becomes significant prob 20% at
least from these 2 points.  We are working aorund the real problem and that
is concurrency.

The whole purpose of the borrowed pointers seem to me to run it in a single
threaded task to remove this concurrency cost surely there are better ways
than borrowed pointers and hence being tied to an Actor implimentation.. eg
In C# most classes are not thread safe anyway (even the collections use a
wrapper with a lock) ..so why generate concurrent code. , it would be
better to mark a function  ,  class  , assemblies or scopes /closures ..   as
concurrent and then any data it touches would be concurrent but everything
else is not ..  Im sure i can think up better ways  so there must be
research on this.  Also note that you make the IR/CIL output have the
concurrent or single threaded  Ref counts   ( but you cant do this with
bounds checking i  think as its normally eliminated in the JIT)


Ben


On Sat, Oct 12, 2013 at 1:48 AM, Jonathan S. Shapiro <[email protected]>wrote:

> I owe a bunch of people here an apology. There is a *lot* to know in this
> space. Sometimes it gets overwhelming, I get stuck, and I start refusing
> input. This can lead me to be pretty stupid. The only saving grace is that
> I eventually get embarrassed enough to catch up on the literature and
> un-stick myself.
>
> We've had an ongoing discussion about garbage collection (GC) here,
> notably focusing on real-time requirements. There has been a side
> discussion on reference counting (RC), and a second side discussion on
> regions. I'm afraid I rejected RC on the basis of its notoriously bad
> performance, mainly because I didn't know about the RC-immix work. I think
> I need to start this discussion over now that I'm listening. Let me begin
> by trying to characterize what I now think I know. I strongly encourage
> that everybody read the paper on 
> Immix<http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf>,
> and then the paper on 
> RC-immix<http://research.microsoft.com/pubs/202163/rcix-oopsla-2013.pdf>,
> in that order. I found the RC-immix paper amusing, because when I read
> their earlier 
> paper<http://users.cecs.anu.edu.au/~steveb/downloads/pdf/rc-ismm-2012.pdf>on 
> reference counting, my first thought was that they had to be eating 10%
> or so in arena management. Sure enough, that was correct. But an additional
> reason to read the two immix papers is that they lay out a consistent
> terminology for discussion.
>
> There are three issues to consider in any storage allocation/reclamation
> approach: total CPU overhead, total memory overhead (measured in required
> real pages, not heap size) and pause times. For our purposes, CPU overhead
> can reasonably be approximated by analysis of L1 and L2 cache misses. TLB
> misses are also important, and the literature I have seen doesn't seem to
> examine that number. The TLB is generally a *much* smaller cache than the
> I- or D-cache, and the cost of a TLB miss is often higher than that of an
> I- or D-cache miss.
>
> *Overheads*
> *
> *
> Let me start with a summary of what has been reloaded into my head at this
> point:
>
> 1. Any scheme that uses a *k*-sized free list mechanism naively is
> subject to both immediate overhead and fragmentation. The overhead arises
> from the need to manage the size-based free lists. Not so much the lists
> themselves as the policy decisions about coalescing those lists. The
> fragmentation arises because operations that free objects do not obey
> allocation order. This causes the arena's locality to become progressively
> less aligned with allocation order. Absent dynamic feedback, the best
> predictor we have for access locality is still allocation order. The
> elimination of a k-sized free list mechanism is the main advantage of
> compacting and evacuating collectors.
>
> 2. The object release rate is bounded (from above) by the number of
> pointer overwrites. This is true because *some* pointer overwrites cause
> objects to become unreferenced. Unfortunately this is a very *loose*bound. 
> There is very compelling data that few old objects are ever
> released, and that *when* released, they are released in bulk. The cost
> of an old-space mark pass is tightly proportional to the size of the
> old-space heap. The cost of reference counting is tightly proportional to
> the rate of pointer overwrites.
>
> 3. *All* GC and RC systems require some form of mark pass that is linear
> in the size of the heap. GC systems require this as their baseline
> mechanism. RC systems require it for cycle collection (yes, I know about
> trial unreference techniques - I'm trying to stick to practical options
> here). In RC systems, the expected yield of the mark pass is much lower
> than in GC systems, because the majority of releases are accomplished
> promptly by RC.
>
> The *cost* of the mark (and possibly sweep) pass comes in multiple forms:
> CPU cost, cache pollution, power consumption, and (in some cases) delay of
> the mutator thread(s).
>
> 4. The pause times associated with collections in new space are small, and
> mainly tolerable. Locality concerns are much reduced in new space, mainly
> because the entirety of new space tends to fit in L1 cache. There are
> several sensible algorithms for new space, some of which are concurrent.
> While we need to choose a (family of) algorithm(s) with care, this does not
> seem to be the thing that is really in the way of real-time subsystems. I
> suspect it is useful for the application to be able to say, explicitly,
> "this is a long-lived allocation".
>
> 5. In systems that have any sort of real-time requirement, concurrent
> collection is really the only game in town.
>
> 6. The RC-immix work has brought reference counting back into the realm
> where it can reasonably be considered. The big win is that it reclaims much
> of the locality that has previously been achieved only in
> compacting/evacuating collectors.
>
> 7. In GC systems, there is a potentially long delay between the time an
> object is unreachable and the time its storage can be reused. This delay
> becomes progressively higher as the generation gets older. In RC systems,
> this delay is present for *cyclic* objects, and for the small proportion
> of objects whose reference count reaches a so-called "sticky" count.
>
> 8. *Every GC system has a bad case*. Even for purely concurrent GC's,
> there is a relationship between CPU overhead, heap size, release rate, and
> allocation rate. If the behavior of the application goes "outside the
> zone", real-time guarantees are lost. I don't see this as bad, because in
> the real world *all* real-time guarantees are a function of available
> resource and tolerable cost. The problem I have with this is that the
> envelope is rarely characterized well enough to do any good.
>
> 9. An under-noted advantage of RC is that it does not (or at least *should
> * not) incur the real-DRAM overheads of GC. This is a very important
> thing.
>
> *Relevance of Regions*
> *
> *
> I still have an intuition that regions are relevant - or at least, *first
> class* regions are relevant. It is often true that developers *know* when
> they are allocating cyclic structures, and that the cyclic nature of the
> structure is durable. That is: once attached to the graph, objects aren't
> released until the whole graph is released. It still seems possible to me
> that we can make use of static checking to ensure that the developer's
> belief is correct. Having done so, it seems to me that we can exploit the
> result for GC purposes. For example, if we know about some object "all
> pointers to this object come from within region R, except for borrowed
> pointers and root pointers", we have effectively told the GC that (a) these
> objects all belong in a common generation, and (b) for cycle detection
> purposes, we don't need to do a full mark pass.
>
> *Relevance of Borrowed Pointers*
> *
> *
> I'm still confused about how far borrowed pointers go in Rust, but the key
> point is that they are "invisible to the collector". To the extent that
> this is true, borrowed pointers *should* also be invisible for reference
> count maintenance purposes. Since the main source of hassle with reference
> counting is pointer overwrite, I think it's very important for us to
> understand the type rules for borrowed pointers.
>
> I have a sneaking suspicion that the idea of "store once" pointers is very
> important as well. The distinction in my mind between "store once" and
> "init only" is that a store-once pointer can be transitioned from NULL to
> an object reference by a store operation after initialization has occurred.
> The reason this matters is that such a pointer cannot cease to exist by
> virtue of overwrite, which means that local copies of the value are more
> readily observed to be borrowed.
>
> More generally this leads me to notice that we need to be doing local
> (conservative) liveness inference.
>
> *Opportunistic Relocation*
> *
> *
> One attraction to (most) RC schemes is that objects don't get relocated.
> This makes it *much* easier to interact successfully with libraries
> written in C/C++. So much so that some amount of performance can and should
> be sacrificed in the interest of interoperability.
>
> A key reason the immix work succeeds is that it does opportunistic
> relocation, which would appear (at first glance, anyway) to defeat
> interoperability with C/C++. The immix approach, however, degrades
> gracefully in the presence of pinned objects. As long as the pins happen
> properly, C and C++ code can reference objects in the GC heap.
>
> An interesting question here becomes: could we hybridize C to sandbox
> "unsafe" references and introduce a new class of "memory safe" references?
> Things like this have been tried before, obviously.
>
>
>
> *Summary*
> *
> *
> I'm inclined, at this point, to adopt something like RC-immix as the
> baseline for storage management in BitC.  The one piece that I don't (yet)
> feel I have a handle on is whether modern RC methods degrade, and if so how
> badly, on weakly ordered memory subsystems.
>
>
> shap
>
> _______________________________________________
> bitc-dev mailing list
> [email protected]
> http://www.coyotos.org/mailman/listinfo/bitc-dev
>
>
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to