Re: Modern Garbage Collection (good article)

Gil Tene Thu, 22 Dec 2016 13:06:29 -0800


On Thursday, December 22, 2016 at 10:46:47 AM UTC-8, Vitaly Davidovich 
wrote:
>
>
>
> On Thu, Dec 22, 2016 at 12:59 PM, Gil Tene <g...@azul.com <javascript:>> 
> wrote:
>
>>
>>
>> On Thursday, December 22, 2016 at 9:33:09 AM UTC-8, Vitaly Davidovich 
>> wrote:
>>>
>>>
>>>
>>> On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote:
>>>
>>>> Go's GC story is evolving. And it's doing so at a fairly quick pace. 
>>>> Judging it based on a current implementation and current choices is 
>>>> premature, I think.
>>>>
>>> Doesn't seem premature at all to me.  Current implementation is what 
>>> runs in production, what people will experience, and what they should 
>>> expect/write their code towards.  This is not too dissimilar from the 
>>> Sufficiently Smart Compiler - it *could* do something in *theory*, but 
>>> practical/real implementations don't.  That's not to say their story won't 
>>> improve over time, but judging current impl is pretty fair IMO. 
>>>
>>
>> Looking at the current state of implementation to judge current use is 
>> fine. But projecting from the current state to Go as a whole and assuming 
>> inherent qualities (something that I see a lot of) would be similar to 
>> concluding that "Java is slow because it is an interpreted language" and 
>> basing those conclusions on 1997 JVMs.
>>
>  Sure. 
>
>>  
>>
>>> I think that the Go team's *implementation priority* focus on latency is 
>>>> a good one. And unlike others, I don't actually think there is a strong 
>>>> latency vs. throughput tradeoff in the long run. Achieving great latency 
>>>> in 
>>>> GC does not inherently come at the expense of GC throughput or application 
>>>> throughput. Both can one had at the same time. My view is that GC 
>>>> implementations in general (for all runtimes) will move towards good 
>>>> latency + good throughput at the same time at they mature. And Go is no 
>>>> different here.
>>>>
>>> I don't see how minimizing latency cannot impact application 
>>> throughput.  The impact can be minimized, but I don't see how it can be 
>>> (nearly) zero.  The impact will vary from app to app, depending on how much 
>>> it intrinsically can make use of CPU before bottlenecking elsewhere and 
>>> giving up CPU for reasons other than GC. 
>>>
>>
>> The are tiny tradeoffs involved, but they are "in the noise", and are 
>> usually immeasurable in practice. So yes, I'm saying that it is "nearly 
>> zero". Concurrency and latency are not inherently coupled with efficiency 
>> choices in GC work. In practice, they tend to have very little to do with 
>> each other.
>>
> They're necessarily coupled IMO.  If a CPU is occupied with GC work that 
> could otherwise be used by the user app, it's a throughput hit.  
>

You may be confusing what you know about the details of some "somewhat 
concurrent, but not designed for simultaneous concurrency and efficiency" 
implementations with what is inherent to performing concurrent GC work.

The bulk (as in the vast vast vast majority) of overall rate of CPU 
consumed by GC work is an inherent result of allocation rate (which is a 
unit of throughput). How much of that work you do per allocation unit is 
NOT strongly affected by concurrency and interleaving choices in GC 
algorithms. I listed examples of dramatically more dominant algorithmic 
choices that have nothing to do with concurrency or incremental ability.

> If extra instructions are executed on behalf of GC while also performing 
> user work, it's a loss in throughput (e.g. write barriers, indirection 
> barriers that may hijack execution, etc).
>

Again, most of that "extra instruction" cost is driven not by concurrency 
or incremental ability, but by other large-throighpout-affecting choices. 
E.g. the need for a write barrier is usually dictated by generational 
collection (STW or not doesn't matter), not by concurrency. And adding a 
write barrier (to achieve generational collection) is a proven way to 
dramatically increase GC efficiency.

>   I'm not even talking about the possible d-cache and i-cache pollution, 
> cache misses, registers reserved to support GC, etc that may be involved.
>

The impacts of these is so tiny that it is hard to measure. Certainly a few 
orders of magnitude smaller than e.g. generational collection yes/no 
decisions.

Since GC cannot know the allocation/memory pattern perfectly, it'll 
> ultimately perform more CPU work than could otherwise be necessary.  This 
> is the price of admission for this type of automatic management.
>

When a concurrent GC's triggering and throttling mechanisms successfully 
get to 98%+ occupancy on regular basis, this perceived cost fades away. And 
it is easy and common to see that level of full use of the heap resources 
[yes, "70% occupancy trigger" is a silly thing, but it is in no way 
inherent or needed, it's just an artifact of some implementations]. Even 
where things are more conservative (GC kicks off earlier than it should), 
the impacts of the extra time spent tend to be significantly lower than 
other choices, and are often offset by the easier choices that concurrency 
allows.

To use a pratical example: a concurrent generational collector can easily 
dispense with the artificial line between oldgen and newgen heaps that many 
STW collectors employ, because when neither generation incurs significant 
pauses, the original reason for having the line there is gone (when oldgen 
STW was significant, exhausting the oldgen early was unacceptable). This 
allows the newgen to always use the entire available heap, reducing newgen 
collection frequency by a significant factor, which eliminates a 
significant amount of GC work (with the same application throughput in the 
same heap size). This results in LESS work for GC, and higher throughput. 
This increased efficiency is more an artifact of GC algorithm choice (share 
the heap between old and newgen rather than segregate it) than a result of 
concurrency. Concurrency simply enables that choice to be made much more 
easily.  

There's nothing inherently wrong with it if that cost is acceptable.  
> Likewise, there's a development/mental cost to manual memory management as 
> well.  It's all about priorities and cost.
>

A virtually-zero additional cost in CPU is acceptable by definition.

>> The fundamentals are simple:
>>
>> For a given live set, heap set, and rate of allocation, a GC algorithm 
>> has a certain amount of work to do. It needs to locate the dead material 
>> and reclaim it such that it can be allocated into again. The choice of 
>> algorithms used affect the amount of work needed to achieve this. But this 
>> amount of work tends to be only minutely affected by concurrency concerns, 
>> and is dramatically affected by other choices. E.g. the choice to use 
>> compacting collectors dramatically affects the cost of allocation (bump 
>> pointer allocation is dramatically cheaper than allocation in 
>> non-contiguously empty regions). E.g. the choice of using a generational 
>> filter (or not) has several orders of magnitude higher impact on GC 
>> throughput than the choice to do concurrent or tiny-incremental-STW work 
>> (or not). The choice to use evacuation as a compaction technique (common in 
>> copying and regional collectors, but not in classic MarkSweepCompact) can 
>> dramatically affect efficiency (in two opposite directions) and depends a 
>> lot more on the amount of empty heap (aka "heap headroom") available than 
>> on concurrency or incremental STW design choices.
>>
> Concurrent collectors, at least in Hotspot, have a tendency to be outpaced 
> by the mutators because they try to minimize throughput impact (i.e. they 
> don't run all the time, but get triggered when certain thresholds are 
> breached at certain check points).  Then you have a problem of object graph 
> mutation rate (not allocation rate), which in the case of G1 will also 
> increase SATB queue processing and finalize marking phases.  Yes, one can 
> tune (or code) around it, and throw lots more headroom when all else fails.
>

You note examples from partially concurrent collectors that had never 
started with a "stay away from STW pauses" approach, and were not designed 
to maintain collector efficiencies independent of mutation and/or 
allocation rates. Future Go collectors do not have to mimic those design 
choices, and neither are other already-successful concurrent collectors 
that produce both consistent low latency and high throughput at the same 
time.

Being outpaced by mutators is a problem only if the algorithms you choose 
are susceptible to such outpacing. Same for mutation rates. Making it 
impossible to be outpaced by the mutator by guaranteeing single pass 
completion and making the collector work invariant to mutation (e.g. by not 
needing to track or react to mutations to complete a GC phase) eliminates 
those problems. They are design choices driven by simple core statements, 
like "don't make collector work per allocation unit vary with allocation 
rate" and "don't make collector work per allocation unit vary with mutation 
rate". Both are doable without impacting efficiency.

> So an alternative is you make mutator threads assist in 
> relocations/fixups, which I believe is what C4 does (in a nutshell, I'm 
> glossing over a bunch of things).  But, this adds read barriers.  I know we 
> talked about their purported cost before.  Certainly one can try to 
> minimize the impact, but I'm having a hard time believing that it 
> approaches the same level of impact as a full STW collector.  Of course the 
> impact will vary based on app profile (e.g. reference/pointer heavy vs 
> flatter structures, object graph mutation rate, etc).
>

There are many possible designs that can meet those needs without some 
inherent sacrifice in efficiency. Especially when that efficiency is often 
regained through some other side-effect.

Yes, the C4 LVB barrier in one such design example. It amounts to adding a 
test+jmp sequence in reference load path [no memory access, 
virtually-always-predictable, devolves to a single u-op on x86]. While 
there is some "cost" to this extra u-op, it's throughput impact is 
insignificant when compared to other things (e.g. a typical 3-4x reduction 
in newgen frequency). It is already significantly cheaper than the extra 
write barrier work done in G1 for remembered set tracking for STW 
compaction, for example.

An extreme way to think of a STW collector is as "the mutators are actually 
doing all the GC work". It's just that the mutators have stopped doing any 
mutator-useful work until they are done with all the GC work when a 
collection is triggered. A concurrent collector shifts the bulk of the work 
to background GC threads that may be able to do that work concurrently with 
the mutator threads doing something useful. Loss of efficiency is in no way 
inherent to that shift of work (at least not any more than STW parallelism 
with work stealing queues is less efficient that single-threaded operation).

A self-healing barrier (like the one used in C4) is one example of a design 
mechanism that maintains this efficiency. Self-healing barriers eliminate 
the inefficiency sometimes found in concurrent collectors buy ensuring that 
barriers only [effectively] trigger at most once per work unit. This makes 
the GC work invariant to heap topology, object graph mutations, etc., and 
to who [mutator or GC thread] is actually doping the work. The work per 
allocation unit then remains fairly fixed. The wish to shift the 
vast-vast-vast majority of that work to collector threads is driven purely 
by mutator latency concerns, but does not affect efficiency in any 
fundamental or significant way. 

> Given the existence of concurrent collectors that are already able to 
>> sustain higher application throughput that STW in many practical cases 
>> (e.g. "Cassandra"), the "nearly free" or "even cheaper than free" argument 
>> has practical proof points. 
>>  
>>
>>>
>>>> A key architectural choice that has not been completely finalized in Go 
>>>> (AFAICT) has to do with the ability to move objects. It is common to see 
>>>> early GC-based runtime implementations that do not move objects around 
>>>> succumb to an architectural "hole" that creates a reliance on that fact 
>>>> early in the maturity cycles. This usually happens by allowing native code 
>>>> extensions to assume object addresses are fixed (i.e. that objects will 
>>>> not 
>>>> move during their lifetime) and getting stuck with that "assumed quality" 
>>>> in the long run due to practical compatibility needs. From what I can 
>>>> tell, 
>>>> it is not too late for Go to avoid this problem. Go still has precise GC 
>>>> capability (it knows where all the heap references are), and technically 
>>>> can still allow the movement of objects in the heap, even if current GC 
>>>> implementations do not make use of that ability. My hope for Go is that it 
>>>> is not too late to forgo the undesirable architectural limitation of 
>>>> requiring objects to stay at fixed addresses, because much of what I think 
>>>> is ahead for GC in Go may rely on breaking that assumption. [to be clear, 
>>>> supporting various forms of temporary "pinning" of objects is fine, and 
>>>> does not represent a strong architectural limitation.]
>>>>
>>>> The ability to move objects is key for two main qualities that future 
>>>> Go GC's (and GCs in pretty much any runtime) will find desirable:
>>>>
>>>> 1. The ability to defragment the heap in order to support indefinite 
>>>> execution lengths that do not depend on friendly behavior patterns for 
>>>> object sizes over time. And defragmentation requires moving objects around 
>>>> before they die. Heaps that can compact are dramatically more resilient 
>>>> than ones that don't. Much like GC itself, self-defragmenting heaps allow 
>>>> the relatively easy construction of software that is free to deal with 
>>>> arbitrarily complex data shapes and scenarios. The value of this freedom 
>>>> is 
>>>> often disputed by people who focus on "things that have worked in C/C++ 
>>>> for 
>>>> a long time", but is easily seen in the multitude of application logic 
>>>> that 
>>>> breaks when restricted to using "malloc'ed heap" or "arena allocator" 
>>>> memory management mechanisms.
>>>>
>>>> 2. The ability of move objects is fundamental to many of the techniques 
>>>> that support high throughput GC and fast allocation logic. Generational GC 
>>>> is a proven, nearly-universally-used technique that tends to improve GC 
>>>> throughput by at least one order of magnitude [compared to 
>>>> non-generational 
>>>> uses of similar algorithms] across a wide spectrum of application 
>>>> behaviors 
>>>> in many different runtimes. It achieves that by fundamentally focusing 
>>>> some 
>>>> efficient-at-sparse-occupancy algorithm on a known-to-be-sparse logical 
>>>> subset of the heap. And it maintains the qualities needed (tracking the 
>>>> logical subset of the heap, and keeping it sparse) by using object 
>>>> relocation (e.g. for promotion). While there *may* exist alternative 
>>>> techniques that achieve similar throughput improvements without moving 
>>>> objects, I don't know of any that have been widely and successfully used 
>>>> so 
>>>> far, in any runtime.
>>>>
>>>> I would predict that as Go's use application cases expand it's GC 
>>>> implementations will evolve to include generational moving collectors. 
>>>> This 
>>>> *does not* necessarily mean a monolithic STW copying newgen. Given their 
>>>> [very correct] focus on latency, I'd expect even the newgen collectors in 
>>>> future generational GC collectors to prioritize the minimization or 
>>>> elimination of STW pauses, which will probably make them either fully 
>>>> concurrent, incremental-STW (with viable arbitrarily small incremental 
>>>> steps), or a combination of the two.
>>>>
>>>> And yes, doing that (concurrent generational GC) while supporting great 
>>>> latency, high throughput, and high efficiency all at the same time, is 
>>>> very 
>>>> possible. It is NOT the hard or inherent tradeoff many people seem to 
>>>> believe it to be. And practical proof points already exist.
>>>>
>>> If it's not "hard", why do we not see more GCs that have great 
>>> throughput and low latency? 
>>>
>>
>> Focus? Working on the right problems?
>>
>> It all starts with saying that keeping latency down is a priority, and 
>> that STW pauses (larger than other OS artifacts at least, at least) are 
>> unacceptable. The rest is good engineering and implementation.
>>
> But why focus on latency if you're saying it's not "hard" to have low 
> latency and high throughput? 
>

I used "hard" in "hard tradeoff" above in the sense off "rigid", 
"necessary", or "inherent". Not in the sense "difficult" or "complicated". 

As to why focus on latency? I'm not saying focusing only on latency is any 
more right than focusing only on throughput. But ignoring latency in the 
core design choices is something that its hard to correct later (as 
evidenced by 20 years of attempts to deal with this in HotSpot collectors, 
that skipped latency-derived core design drivers).

Focusing on Latency AND Throughout is the best way too go IMO. And I'm 
saying that the two are in no way contradictory or opposite. They can 
sometimes even complimentary. A high throughput concurrent collector is a 
very possible (and very real) thing.

The Go team have started from the latency side, and IMO they will make lots 
of future progress on the throughput side. That's why I predict moving 
generational collection and compacting oldgen in their future. But it's 
also why I predict that those collectors will not be doing monolithic 
stop-the-world moving and compaction. i.e. I think they will end up with 
either concurrent or finely-incremental-STW relocation [of which I think 
concurrent is both more likely and more desirable] in all generations.

> Surely if that's the case and you can have your cake and eat it too, it 
> would be done routinely?
>

We've been routinely eating our cake and having it with C4. And I firmly 
believe there are at least 17 other ways for other to skin that same cat. A 
focused set of people undeterred by "Surely if this was doable it would 
already have been done" talk is all it takes. And especially since it 
actually HAS been done at least once, the question of feasibility is 
already answered.

> I think Hotspot initially focused on pure throughput - when the java 
> threads are running, the only overhead is card marks; otherwise, they own 
> the CPU for the time slice they're there and all (otherwise) eligible CPUs 
> are available for the app to use.
>

The overall approach with HotSpot can be though of as "start with a STW 
generational collector and increment your way to lower average/typical 
pauses in the old generations". E.g. all practical collectors in HotSpot 
use a monolithic stop-the-world copying newgen collector (not concurrent or 
incremental, monolithic, run-to-completion STW, always). And all collectors 
in HotSpot have a fallback to a full STW oldgen pause. The various 
non-monolithic-STW "cool modes" (Partially Concurrent non-compacting Oldgen 
in CMS, Incremental STW compaction in G1) are filters in the oldgen that 
occur between full STW newgen pauses and full STW oldgen pauses. They are 
"happy" when the filters work well enough to keep the STW newgen pauses in 
the tens of msec and delay the full-STW pauses for hours or days. And they 
consider it to be "ok and still working" even when those levels are not met.

What HotSpot has never focused on is capping STW pause times, which would 
require the fundamental way of collecting both the oldgen and the newgen to 
now be STW (including any fallback). And since that has never been the 
focus, not a lot of design has been put into achieving concurrency and 
throughput at the same time. E.g. since the vast amount of CPU in HotSpot 
GC tends to be spent in the newgen collectors, and those are always 
monolithic-STW (i.e. they have zero concurrency with the application), 
considerations of concurrency and throughout don't pop up to the top of 
anyone's list of concerns.

But that's just HotSpot. It's not "Java". Just one [very] popular 
implementation of a Java runtime.

Go doesn't have to start from the same point. Instead they seem to be 
starting from a "start with a low-latency GC and increment your way to 
higher throughput". They seem to have taken that [very smart] step early 
[after having a pretty annoying STW early history, which I'd equate to the 
"classic VM" times in the early Java days], and I'm hoping they'll stick 
with that path.

>
>> In that sense, I think the Go team has the right focus and started from 
>> working on the right problems. They have a bunch more engineering and 
>> implementation work ahead to cover wider use cases, but their choices are 
>> already having an effect.
>>
>>  
>>
>>>
>>>> On Thursday, December 22, 2016 at 4:12:14 AM UTC-8, Vitaly Davidovich 
>>>> wrote:
>>>>>
>>>>> FWIW, I think the Go team is right in favoring lower latency over 
>>>>> throughput of their GC given the expected usage scenarios for Go.
>>>>>
>>>>> In fact, most of the (Hotspot based) Java GC horror stories involve 
>>>>> very long pauses (G1 and CMS not excluded) - I've yet to hear anyone 
>>>>> complain that their "Big Data" servers are too slow because the GC is 
>>>>> chipping away at throughput too much (most of those servers aren't CPU 
>>>>> bound to begin with or can't become CPU bound due to bottlenecks 
>>>>> elsewhere).
>>>>>
>>>>> In the Hotspot world, there's also the unfortunate situation right now 
>>>>> where G1 isn't really solving any problem nor advancing the performance 
>>>>> (latency) boundaries.  In fact, it taxes throughput quite heavily (e.g. 
>>>>> significantly more expensive write barriers) but isn't breaking any new 
>>>>> ground.
>>>>>
>>>>>
>>>>> On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote:
>>>>>
>>>>>> You might be interested in this Golang-group follow-up thread too… 
>>>>>> https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4 
>>>>>> (naturally a more pro-Go
>>>>>> take on things…)
>>>>>>
>>>>>> On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young wrote:
>>>>>>>
>>>>>>> Thought people on here would enjoy this
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ptdzwcmq2
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>>
>>>>>>>
>>>>>>> Studying for the Turing test
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>>
>>>>>>
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "mechanical-sympathy" group.
>>>>>>
>>>>>>
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>>>>>
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>>
>>>>>> -- 
>>>>> Sent from my phone
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "mechanical-sympathy" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to mechanical-sympathy+unsubscr...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Modern Garbage Collection (good article)

Reply via email to