Oh, wait.

This isn't about improving branch prediction. This is about eliminating the
read barriers entirely when conditions do not require them. The inlined
portion of the read barrier goes away *entirely*. This reduces dynamic
instruction counts by 28% to 36% compared to code that contains read
barrier checks, and eliminates a large number of hard-to-schedule
data-dependent loads. The overhead of the translation itself is compensated
by the fact that the translation process performs trace linearization as a
side effect. With a little bit of care, it can also keep track of which
registers currently hold object references.


shap


On Sun, Oct 20, 2013 at 6:04 PM, Sandro Magi <[email protected]> wrote:

> Re: dynamic translation for read barriers, my initial impression is
> skepticism that the improved branch prediction a dynamically swapped read
> barrier would outweigh the cost of i-cache flushes. Has this sort of design
> actually been measured?
>
> Sandro
>
>
>
> "Jonathan S. Shapiro" <[email protected]> wrote:
>
> OK. It's time for me to step back from worrying about GC design and get
> back to actual work. I've internalized a bunch of stuff and I've managed to
> catch up on the last 15 years or so of the literature (almost), and I've
> gotten a lot from all of you in the process. I'm going to send Tony and
> Eliot a note arising from this discussion, and try to encourage a more
> focused discussion of read barriers, forwarding, and concurrent design
> tradeoffs in the next edition of the *The Garbage Collection Handbook*.
> For now, let me try to run down where I am.
>
> I'm very impressed by the current state of RC-collectors, and I've talked
> elsewhere about how this approach opens up a bunch of design options and
> expressiveness options. The key to variable-weight and/or
> performance-isolated collection is to introduce [possibly dynamic] heap
> membranes, and keep track of the object references that cross membrane
> boundaries. Other techniques can do it, but hybrid-RC designs seem to do it
> without introducing multiplication of mechanism. Especially so when some
> variation of the immix heap organization is adopted.
>
> We've talked a lot about GC performance, and I've come to feel that we are
> using the wrong metrics. Safe language throughput seems to be a mostly
> solved problem at this point, by which I mean both that (a) the difference
> relative to explicit storage management is now very small, and (b) any
> inherent cost of safety is now low enough that I'm willing to pay it, and I
> think a lot of other people are too. The question of pause times is a
> different matter, and I've become convinced that by focusing on pause times
> we have been framing the question the wrong way. I think we should be
> focusing on three variables:
>
>    - A time interval (T). Moving closer to hard real-time applications
>    may require shorter intervals.
>    - Mutator *progress* during each time interval (P). Not just "what
>    percentage of time is spent in the mutator" but also "How much real-world
>    computational progress is it making?" Spending 10% more time in the mutator
>    while adding 12% overhead to the mutator isn't winning.
>    - *Variance* of that progress over that time interval (V)
>
> We would like flexibility in choosing T, possibly at some cost in the rate
> of progress P, the variance of progress V, or both. Unfortunately our
> collection needs are also bounded by allocation rate and mutate rate (e.g.
> through logging), so this isn't a complete characterization. If we can
> arrive at a solution that makes these things somewhat tunable, I don't
> think we should be worrying about the last 2% of throughput performance.
>
> On parallel (as opposed to concurrent) collectors we are now looking at
> <50ms pause times on small-scale multiprocessors, with 10% or less (typ) of
> total time spent in storage management [*] While we incur both code bloat
> and execution overhead in barrier code, that is made up (relative to
> conventional allocators) in reduced L1 cache miss rates. The code bloat is
> *also* made up (usually) by decreases in the overhead information that
> must be recorded on each allocation. I've been talking, in the background,
> to people I know who build production video games and codecs. Faced with
> the prospect of a 50ms pause, they mostly didn't care. Several of them
> noted that current context switch and page fault times often exceed this
> already, though the "all threads" aspect of the pause may have implications
> they weren't considering.  *It is not obvious that on-the-fly is better
> for real-time applications, nor is it obvious that on-the-fly is required for
> real-time applications.*
>
> [*] The exception is applications that bulk allocate and bulk deallocate
> durable structures, Xalan being a notable example. I believe we all agree
> that there are cases where the programmer knows what their needs are, and
> the allocator could cooperate with those needs if a means existed to
> express them. This won't solve all cases, but it will solve a lot of them.
>
>
> Almost all of the mutator performance in these discussions hinges on read
> barriers: do we need them, and if so, do they need to be fine-grained (i.e.
> inlined with the load instructions) or can they be done with page
> protections? Here are the answers I have come to (so far):
>
>    1. Read barriers of some form are required on any sub-heap where
>    collectors and mutators may operate concurrently and evacuation/relocation
>    may be performed. This means that it is *extremely* desirable to use
>    thread-local nursery allocation.
>    2. Fine-grain read barriers are *sometimes *required if the mutator
>    participates cooperatively in the maintenance of the tri-color (or K-color)
>    invariant. It depends on the type of mutator.
>    3. Fine-grain read barriers are required if heap regions that are
>    subject to recoloring and/or evacuation are "reference hot", because the
>    page fault approach isn't cost effective.
>    4. Fine-grain read barriers are required in practice if nurseries are
>    not thread-local.
>
> Taken in combination, these conditions mean that read barriers are needed
> for any on-the-fly collection scheme, and also for any fully concurrent (as
> opposed to *mostly* concurrent) scheme.
>
> I've noted in discussion that there are ways to emit code in which read
> barriers can be turned on and off dynamically through low-level dynamic
> translation. This prospect gives us a late binding option on the choice of
> collector integration strategy, which strikes me as elegant. It also means
> that we can separate many of the GC integration issues from the compiler
> entirely. We can instead make our choices late on the basis of how hot the
> forwarded objects are, what proportion of mutator execution time is
> overlapped with collector execution, the dynamic frequency and dynamic
> locality with which the conditions that *trigger* read barriers arise,
> and our target objectives for P, V, and T.
>
> I have been focusing a lot on relocating collectors, mainly because
> evacuation of some form seems to be a necessary activity for heap
> defragmentation. It isn't clear that this is really required. In the immix
> design, for example, we're already prepared to do chunked bump allocation.
> It is conceivable that a block could be considered "free" if it is "free
> enough" without being totally cleared. If that is possible, then the need
> for fine-grain read barriers goes away. It is also possible that we could
> size-segregate the blocks in the tenured heap to control fragmentation. It
> is also possible that inter-operation with C creates enough dynamic
> restrictions on relocation (through pinning) that relocation isn't that
> helpful. The pressure to relocate is a function of total heap size
> pressure, which is inversely related to available system DRAM. An
> unfortunate implication of this is that small platforms are more likely to
> use read barriers (with their associated code bloat) than large platforms.
>
> The practical benefit of on-the-fly collection (as opposed to concurrent
> collection) seems uncertain. When all's done, there are only so many
> hardware threads of execution available, and the activity of these hardware
> threads is going to go back and forth between mutating and collecting. The
> "flip" can be implemented voluntarily in the mutator or implicitly in the
> operating system. The cost of doing it implicitly is (a) need for a
> fine-grain read barrier, (b) loss of mutator control about when collection
> happens, and (c) lost ability to drive collection in finer-grain
> increments. The primary *advantage* of on-the-fly is that it allows
> collection to proceed during long I/O waits.
>
> Once a fine-grain read barrier is required, it is worth considering how
> many of the checks previously performed by write barriers can instead be
> moved to read barriers. It is also worth considering a notion of a
> "read-write barrier". That is: combining the read and write barriers when
> we are reading a reference to an object that we know we intend to write. I
> haven't spent much time on either question; the high-order concern was to
> determine whether a fine-grain read barrier was required at all.
>
> Once a fine-grain read barrier is required, you have in some respects
> "paid the piper", and you might as well get all of the advantage out of it
> that you can.
>
> Last, but not least, I note that one of the challenges in deferred
> reference counting is that it makes on-the-fly collection a bit tricky. The
> essence of it is that something analogous to a tri-color invariant needs to
> be maintained on the deferred decrement log so that mutators can go through
> a "phase agreement" with the collector. It could be that this is done using
> the tri-color invariant itself.
>
>
> shap
>
> _______________________________________________
> bitc-dev mailing list
> [email protected]
> http://www.coyotos.org/mailman/listinfo/bitc-dev
>
>
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to