Re: Modern Garbage Collection (good article)

Vitaly Davidovich Thu, 22 Dec 2016 14:19:18 -0800

So I agree that Go focusing on low latency is the way to go.  I'm not sold
that they can make it so low cost that building, for example, a high
performance database in Go would have comparable latency and throughput to
one built in a non-GC'd language/runtime.  All the "little/insignificant"
inefficiencies you brush away as being virtually zero/undetectable/etc add
up.  Heck, we've talked about the costs of context (or mode) switches on
this thread before, and how they can be dominated by the caches needing to
be warmed up (vs register/context being saved/restored).  But you deem that
type of cost negligible.  I'll agree that it's negligible compared to 100s
of millis (or worse - multiple secs or tens of secs) GC pauses.  It's
definitely not negligible for the lower level systems being built.


I'm curious about this bit:

> Yes, the C4 LVB barrier in one such design example. It amounts to adding a
> test+jmp sequence in reference load path [no memory access,
> virtually-always-predictable, devolves to a single u-op on x86]. While
> there is some "cost" to this extra u-op, it's throughput impact is
> insignificant when compared to other things (e.g. a typical 3-4x reduction
> in newgen frequency). It is already significantly cheaper than the extra
> write barrier work done in G1 for remembered set tracking for STW
> compaction, for example.

So there's a test+jmp inserted before every load, which is pretty standard
and common fare in Java code.  But how is it "no memory access"? IIRC, when
we last discussed the intricacies, a LVB would check whether a phase change
occurred, which I think is a memory read, isn't it? Or by "no memory
access" you mean it should still be valid in the cache? Also, I'm assuming
once inlining stops at some point, the LVB/phase change information needs
to be regathered by callees that were not inlined? So in other words, this
will devolve into another prologue (beyond call frame setup)? I can't see
how this won't impact i-cache or pollute the BTB history.

On Thu, Dec 22, 2016 at 3:58 PM, Gil Tene <g...@azul.com> wrote:

>
>
> On Thursday, December 22, 2016 at 10:46:47 AM UTC-8, Vitaly Davidovich
> wrote:
>>
>>
>>
>> On Thu, Dec 22, 2016 at 12:59 PM, Gil Tene <g...@azul.com> wrote:
>>
>>>
>>>
>>> On Thursday, December 22, 2016 at 9:33:09 AM UTC-8, Vitaly Davidovich
>>> wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote:
>>>>
>>>>> Go's GC story is evolving. And it's doing so at a fairly quick pace.
>>>>> Judging it based on a current implementation and current choices is
>>>>> premature, I think.
>>>>>
>>>> Doesn't seem premature at all to me.  Current implementation is what
>>>> runs in production, what people will experience, and what they should
>>>> expect/write their code towards.  This is not too dissimilar from the
>>>> Sufficiently Smart Compiler - it *could* do something in *theory*, but
>>>> practical/real implementations don't.  That's not to say their story won't
>>>> improve over time, but judging current impl is pretty fair IMO.
>>>>
>>>
>>> Looking at the current state of implementation to judge current use is
>>> fine. But projecting from the current state to Go as a whole and assuming
>>> inherent qualities (something that I see a lot of) would be similar to
>>> concluding that "Java is slow because it is an interpreted language" and
>>> basing those conclusions on 1997 JVMs.
>>>
>>  Sure.
>>
>>>
>>>
>>>> I think that the Go team's *implementation priority* focus on latency
>>>>> is a good one. And unlike others, I don't actually think there is a strong
>>>>> latency vs. throughput tradeoff in the long run. Achieving great latency 
>>>>> in
>>>>> GC does not inherently come at the expense of GC throughput or application
>>>>> throughput. Both can one had at the same time. My view is that GC
>>>>> implementations in general (for all runtimes) will move towards good
>>>>> latency + good throughput at the same time at they mature. And Go is no
>>>>> different here.
>>>>>
>>>> I don't see how minimizing latency cannot impact application
>>>> throughput.  The impact can be minimized, but I don't see how it can be
>>>> (nearly) zero.  The impact will vary from app to app, depending on how much
>>>> it intrinsically can make use of CPU before bottlenecking elsewhere and
>>>> giving up CPU for reasons other than GC.
>>>>
>>>
>>> The are tiny tradeoffs involved, but they are "in the noise", and are
>>> usually immeasurable in practice. So yes, I'm saying that it is "nearly
>>> zero". Concurrency and latency are not inherently coupled with efficiency
>>> choices in GC work. In practice, they tend to have very little to do with
>>> each other.
>>>
>> They're necessarily coupled IMO.  If a CPU is occupied with GC work that
>> could otherwise be used by the user app, it's a throughput hit.
>>
>
> You may be confusing what you know about the details of some "somewhat
> concurrent, but not designed for simultaneous concurrency and efficiency"
> implementations with what is inherent to performing concurrent GC work.
>
> The bulk (as in the vast vast vast majority) of overall rate of CPU
> consumed by GC work is an inherent result of allocation rate (which is a
> unit of throughput). How much of that work you do per allocation unit is
> NOT strongly affected by concurrency and interleaving choices in GC
> algorithms. I listed examples of dramatically more dominant algorithmic
> choices that have nothing to do with concurrency or incremental ability.
>
>
>> If extra instructions are executed on behalf of GC while also performing
>> user work, it's a loss in throughput (e.g. write barriers, indirection
>> barriers that may hijack execution, etc).
>>
>
> Again, most of that "extra instruction" cost is driven not by concurrency
> or incremental ability, but by other large-throighpout-affecting choices.
> E.g. the need for a write barrier is usually dictated by generational
> collection (STW or not doesn't matter), not by concurrency. And adding a
> write barrier (to achieve generational collection) is a proven way to
> dramatically increase GC efficiency.
>
>
>>   I'm not even talking about the possible d-cache and i-cache pollution,
>> cache misses, registers reserved to support GC, etc that may be involved.
>>
>
> The impacts of these is so tiny that it is hard to measure. Certainly a
> few orders of magnitude smaller than e.g. generational collection yes/no
> decisions.
>
> Since GC cannot know the allocation/memory pattern perfectly, it'll
>> ultimately perform more CPU work than could otherwise be necessary.  This
>> is the price of admission for this type of automatic management.
>>
>
> When a concurrent GC's triggering and throttling mechanisms successfully
> get to 98%+ occupancy on regular basis, this perceived cost fades away. And
> it is easy and common to see that level of full use of the heap resources
> [yes, "70% occupancy trigger" is a silly thing, but it is in no way
> inherent or needed, it's just an artifact of some implementations]. Even
> where things are more conservative (GC kicks off earlier than it should),
> the impacts of the extra time spent tend to be significantly lower than
> other choices, and are often offset by the easier choices that concurrency
> allows.
>
> To use a pratical example: a concurrent generational collector can easily
> dispense with the artificial line between oldgen and newgen heaps that many
> STW collectors employ, because when neither generation incurs significant
> pauses, the original reason for having the line there is gone (when oldgen
> STW was significant, exhausting the oldgen early was unacceptable). This
> allows the newgen to always use the entire available heap, reducing newgen
> collection frequency by a significant factor, which eliminates a
> significant amount of GC work (with the same application throughput in the
> same heap size). This results in LESS work for GC, and higher throughput.
> This increased efficiency is more an artifact of GC algorithm choice (share
> the heap between old and newgen rather than segregate it) than a result of
> concurrency. Concurrency simply enables that choice to be made much more
> easily.
>
> There's nothing inherently wrong with it if that cost is acceptable.
>> Likewise, there's a development/mental cost to manual memory management as
>> well.  It's all about priorities and cost.
>>
>
> A virtually-zero additional cost in CPU is acceptable by definition.
>
>
>>> The fundamentals are simple:
>>>
>>> For a given live set, heap set, and rate of allocation, a GC algorithm
>>> has a certain amount of work to do. It needs to locate the dead material
>>> and reclaim it such that it can be allocated into again. The choice of
>>> algorithms used affect the amount of work needed to achieve this. But this
>>> amount of work tends to be only minutely affected by concurrency concerns,
>>> and is dramatically affected by other choices. E.g. the choice to use
>>> compacting collectors dramatically affects the cost of allocation (bump
>>> pointer allocation is dramatically cheaper than allocation in
>>> non-contiguously empty regions). E.g. the choice of using a generational
>>> filter (or not) has several orders of magnitude higher impact on GC
>>> throughput than the choice to do concurrent or tiny-incremental-STW work
>>> (or not). The choice to use evacuation as a compaction technique (common in
>>> copying and regional collectors, but not in classic MarkSweepCompact) can
>>> dramatically affect efficiency (in two opposite directions) and depends a
>>> lot more on the amount of empty heap (aka "heap headroom") available than
>>> on concurrency or incremental STW design choices.
>>>
>> Concurrent collectors, at least in Hotspot, have a tendency to be
>> outpaced by the mutators because they try to minimize throughput impact
>> (i.e. they don't run all the time, but get triggered when certain
>> thresholds are breached at certain check points).  Then you have a problem
>> of object graph mutation rate (not allocation rate), which in the case of
>> G1 will also increase SATB queue processing and finalize marking phases.
>> Yes, one can tune (or code) around it, and throw lots more headroom when
>> all else fails.
>>
>
> You note examples from partially concurrent collectors that had never
> started with a "stay away from STW pauses" approach, and were not designed
> to maintain collector efficiencies independent of mutation and/or
> allocation rates. Future Go collectors do not have to mimic those design
> choices, and neither are other already-successful concurrent collectors
> that produce both consistent low latency and high throughput at the same
> time.
>
> Being outpaced by mutators is a problem only if the algorithms you choose
> are susceptible to such outpacing. Same for mutation rates. Making it
> impossible to be outpaced by the mutator by guaranteeing single pass
> completion and making the collector work invariant to mutation (e.g. by not
> needing to track or react to mutations to complete a GC phase) eliminates
> those problems. They are design choices driven by simple core statements,
> like "don't make collector work per allocation unit vary with allocation
> rate" and "don't make collector work per allocation unit vary with mutation
> rate". Both are doable without impacting efficiency.
>
>
>> So an alternative is you make mutator threads assist in
>> relocations/fixups, which I believe is what C4 does (in a nutshell, I'm
>> glossing over a bunch of things).  But, this adds read barriers.  I know we
>> talked about their purported cost before.  Certainly one can try to
>> minimize the impact, but I'm having a hard time believing that it
>> approaches the same level of impact as a full STW collector.  Of course the
>> impact will vary based on app profile (e.g. reference/pointer heavy vs
>> flatter structures, object graph mutation rate, etc).
>>
>
> There are many possible designs that can meet those needs without some
> inherent sacrifice in efficiency. Especially when that efficiency is often
> regained through some other side-effect.
>
> Yes, the C4 LVB barrier in one such design example. It amounts to adding a
> test+jmp sequence in reference load path [no memory access,
> virtually-always-predictable, devolves to a single u-op on x86]. While
> there is some "cost" to this extra u-op, it's throughput impact is
> insignificant when compared to other things (e.g. a typical 3-4x reduction
> in newgen frequency). It is already significantly cheaper than the extra
> write barrier work done in G1 for remembered set tracking for STW
> compaction, for example.
>
> An extreme way to think of a STW collector is as "the mutators are
> actually doing all the GC work". It's just that the mutators have stopped
> doing any mutator-useful work until they are done with all the GC work when
> a collection is triggered. A concurrent collector shifts the bulk of the
> work to background GC threads that may be able to do that work concurrently
> with the mutator threads doing something useful. Loss of efficiency is in
> no way inherent to that shift of work (at least not any more than STW
> parallelism with work stealing queues is less efficient that
> single-threaded operation).
>
> A self-healing barrier (like the one used in C4) is one example of a
> design mechanism that maintains this efficiency. Self-healing barriers
> eliminate the inefficiency sometimes found in concurrent collectors buy
> ensuring that barriers only [effectively] trigger at most once per work
> unit. This makes the GC work invariant to heap topology, object graph
> mutations, etc., and to who [mutator or GC thread] is actually doping the
> work. The work per allocation unit then remains fairly fixed. The wish to
> shift the vast-vast-vast majority of that work to collector threads is
> driven purely by mutator latency concerns, but does not affect efficiency
> in any fundamental or significant way.
>
>
>> Given the existence of concurrent collectors that are already able to
>>> sustain higher application throughput that STW in many practical cases
>>> (e.g. "Cassandra"), the "nearly free" or "even cheaper than free" argument
>>> has practical proof points.
>>>
>>>
>>>>
>>>>> A key architectural choice that has not been completely finalized in
>>>>> Go (AFAICT) has to do with the ability to move objects. It is common to 
>>>>> see
>>>>> early GC-based runtime implementations that do not move objects around
>>>>> succumb to an architectural "hole" that creates a reliance on that fact
>>>>> early in the maturity cycles. This usually happens by allowing native code
>>>>> extensions to assume object addresses are fixed (i.e. that objects will 
>>>>> not
>>>>> move during their lifetime) and getting stuck with that "assumed quality"
>>>>> in the long run due to practical compatibility needs. From what I can 
>>>>> tell,
>>>>> it is not too late for Go to avoid this problem. Go still has precise GC
>>>>> capability (it knows where all the heap references are), and technically
>>>>> can still allow the movement of objects in the heap, even if current GC
>>>>> implementations do not make use of that ability. My hope for Go is that it
>>>>> is not too late to forgo the undesirable architectural limitation of
>>>>> requiring objects to stay at fixed addresses, because much of what I think
>>>>> is ahead for GC in Go may rely on breaking that assumption. [to be clear,
>>>>> supporting various forms of temporary "pinning" of objects is fine, and
>>>>> does not represent a strong architectural limitation.]
>>>>>
>>>>> The ability to move objects is key for two main qualities that future
>>>>> Go GC's (and GCs in pretty much any runtime) will find desirable:
>>>>>
>>>>> 1. The ability to defragment the heap in order to support indefinite
>>>>> execution lengths that do not depend on friendly behavior patterns for
>>>>> object sizes over time. And defragmentation requires moving objects around
>>>>> before they die. Heaps that can compact are dramatically more resilient
>>>>> than ones that don't. Much like GC itself, self-defragmenting heaps allow
>>>>> the relatively easy construction of software that is free to deal with
>>>>> arbitrarily complex data shapes and scenarios. The value of this freedom 
>>>>> is
>>>>> often disputed by people who focus on "things that have worked in C/C++ 
>>>>> for
>>>>> a long time", but is easily seen in the multitude of application logic 
>>>>> that
>>>>> breaks when restricted to using "malloc'ed heap" or "arena allocator"
>>>>> memory management mechanisms.
>>>>>
>>>>> 2. The ability of move objects is fundamental to many of the
>>>>> techniques that support high throughput GC and fast allocation logic.
>>>>> Generational GC is a proven, nearly-universally-used technique that tends
>>>>> to improve GC throughput by at least one order of magnitude [compared to
>>>>> non-generational uses of similar algorithms] across a wide spectrum of
>>>>> application behaviors in many different runtimes. It achieves that by
>>>>> fundamentally focusing some efficient-at-sparse-occupancy algorithm on a
>>>>> known-to-be-sparse logical subset of the heap. And it maintains the
>>>>> qualities needed (tracking the logical subset of the heap, and keeping it
>>>>> sparse) by using object relocation (e.g. for promotion). While there *may*
>>>>> exist alternative techniques that achieve similar throughput improvements
>>>>> without moving objects, I don't know of any that have been widely and
>>>>> successfully used so far, in any runtime.
>>>>>
>>>>> I would predict that as Go's use application cases expand it's GC
>>>>> implementations will evolve to include generational moving collectors. 
>>>>> This
>>>>> *does not* necessarily mean a monolithic STW copying newgen. Given their
>>>>> [very correct] focus on latency, I'd expect even the newgen collectors in
>>>>> future generational GC collectors to prioritize the minimization or
>>>>> elimination of STW pauses, which will probably make them either fully
>>>>> concurrent, incremental-STW (with viable arbitrarily small incremental
>>>>> steps), or a combination of the two.
>>>>>
>>>>> And yes, doing that (concurrent generational GC) while supporting
>>>>> great latency, high throughput, and high efficiency all at the same time,
>>>>> is very possible. It is NOT the hard or inherent tradeoff many people seem
>>>>> to believe it to be. And practical proof points already exist.
>>>>>
>>>> If it's not "hard", why do we not see more GCs that have great
>>>> throughput and low latency?
>>>>
>>>
>>> Focus? Working on the right problems?
>>>
>>> It all starts with saying that keeping latency down is a priority, and
>>> that STW pauses (larger than other OS artifacts at least, at least) are
>>> unacceptable. The rest is good engineering and implementation.
>>>
>> But why focus on latency if you're saying it's not "hard" to have low
>> latency and high throughput?
>>
>
> I used "hard" in "hard tradeoff" above in the sense off "rigid",
> "necessary", or "inherent". Not in the sense "difficult" or "complicated".
>
> As to why focus on latency? I'm not saying focusing only on latency is any
> more right than focusing only on throughput. But ignoring latency in the
> core design choices is something that its hard to correct later (as
> evidenced by 20 years of attempts to deal with this in HotSpot collectors,
> that skipped latency-derived core design drivers).
>
> Focusing on Latency AND Throughout is the best way too go IMO. And I'm
> saying that the two are in no way contradictory or opposite. They can
> sometimes even complimentary. A high throughput concurrent collector is a
> very possible (and very real) thing.
>
> The Go team have started from the latency side, and IMO they will make
> lots of future progress on the throughput side. That's why I predict moving
> generational collection and compacting oldgen in their future. But it's
> also why I predict that those collectors will not be doing monolithic
> stop-the-world moving and compaction. i.e. I think they will end up with
> either concurrent or finely-incremental-STW relocation [of which I think
> concurrent is both more likely and more desirable] in all generations.
>
>
>> Surely if that's the case and you can have your cake and eat it too, it
>> would be done routinely?
>>
>
> We've been routinely eating our cake and having it with C4. And I firmly
> believe there are at least 17 other ways for other to skin that same cat. A
> focused set of people undeterred by "Surely if this was doable it would
> already have been done" talk is all it takes. And especially since it
> actually HAS been done at least once, the question of feasibility is
> already answered.
>
>
>> I think Hotspot initially focused on pure throughput - when the java
>> threads are running, the only overhead is card marks; otherwise, they own
>> the CPU for the time slice they're there and all (otherwise) eligible CPUs
>> are available for the app to use.
>>
>
> The overall approach with HotSpot can be though of as "start with a STW
> generational collector and increment your way to lower average/typical
> pauses in the old generations". E.g. all practical collectors in HotSpot
> use a monolithic stop-the-world copying newgen collector (not concurrent or
> incremental, monolithic, run-to-completion STW, always). And all collectors
> in HotSpot have a fallback to a full STW oldgen pause. The various
> non-monolithic-STW "cool modes" (Partially Concurrent non-compacting Oldgen
> in CMS, Incremental STW compaction in G1) are filters in the oldgen that
> occur between full STW newgen pauses and full STW oldgen pauses. They are
> "happy" when the filters work well enough to keep the STW newgen pauses in
> the tens of msec and delay the full-STW pauses for hours or days. And they
> consider it to be "ok and still working" even when those levels are not met.
>
> What HotSpot has never focused on is capping STW pause times, which would
> require the fundamental way of collecting both the oldgen and the newgen to
> now be STW (including any fallback). And since that has never been the
> focus, not a lot of design has been put into achieving concurrency and
> throughput at the same time. E.g. since the vast amount of CPU in HotSpot
> GC tends to be spent in the newgen collectors, and those are always
> monolithic-STW (i.e. they have zero concurrency with the application),
> considerations of concurrency and throughout don't pop up to the top of
> anyone's list of concerns.
>
> But that's just HotSpot. It's not "Java". Just one [very] popular
> implementation of a Java runtime.
>
> Go doesn't have to start from the same point. Instead they seem to be
> starting from a "start with a low-latency GC and increment your way to
> higher throughput". They seem to have taken that [very smart] step early
> [after having a pretty annoying STW early history, which I'd equate to the
> "classic VM" times in the early Java days], and I'm hoping they'll stick
> with that path.
>
>
>>
>>> In that sense, I think the Go team has the right focus and started from
>>> working on the right problems. They have a bunch more engineering and
>>> implementation work ahead to cover wider use cases, but their choices are
>>> already having an effect.
>>>
>>>
>>>
>>>>
>>>>> On Thursday, December 22, 2016 at 4:12:14 AM UTC-8, Vitaly Davidovich
>>>>> wrote:
>>>>>>
>>>>>> FWIW, I think the Go team is right in favoring lower latency over
>>>>>> throughput of their GC given the expected usage scenarios for Go.
>>>>>>
>>>>>> In fact, most of the (Hotspot based) Java GC horror stories involve
>>>>>> very long pauses (G1 and CMS not excluded) - I've yet to hear anyone
>>>>>> complain that their "Big Data" servers are too slow because the GC is
>>>>>> chipping away at throughput too much (most of those servers aren't CPU
>>>>>> bound to begin with or can't become CPU bound due to bottlenecks 
>>>>>> elsewhere).
>>>>>>
>>>>>> In the Hotspot world, there's also the unfortunate situation right
>>>>>> now where G1 isn't really solving any problem nor advancing the 
>>>>>> performance
>>>>>> (latency) boundaries.  In fact, it taxes throughput quite heavily (e.g.
>>>>>> significantly more expensive write barriers) but isn't breaking any new
>>>>>> ground.
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote:
>>>>>>
>>>>>>> You might be interested in this Golang-group follow-up thread too…
>>>>>>> https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4
>>>>>>> (naturally a more pro-Go
>>>>>>> take on things…)
>>>>>>>
>>>>>>> On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thought people on here would enjoy this
>>>>>>>>
>>>>>>>>
>>>>>>>> https://medium.com/@octskyward/modern-garbage-collection-
>>>>>>>> 911ef4f8bd8e#.ptdzwcmq2
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>>
>>>>>>>> Studying for the Turing test
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "mechanical-sympathy" group.
>>>>>>>
>>>>>>>
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>>>>>>
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>> Sent from my phone
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "mechanical-sympathy" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Modern Garbage Collection (good article)

Reply via email to