On Thursday, December 22, 2016 at 2:18:30 PM UTC-8, Vitaly Davidovich wrote: > So I agree that Go focusing on low latency is the way to go. I'm not sold > that they can make it so low cost that building, for example, a high > performance database in Go would have comparable latency and throughput to > one built in a non-GC'd language/runtime. All the "little/insignificant" > inefficiencies you brush away as being virtually zero/undetectable/etc add > up. Heck, we've talked about the costs of context (or mode) switches on this > thread before, and how they can be dominated by the caches needing to be > warmed up (vs register/context being saved/restored). But you deem that type > of cost negligible. I'll agree that it's negligible compared to 100s of > millis (or worse - multiple secs or tens of secs) GC pauses. It's definitely > not negligible for the lower level systems being built. > > > I'm curious about this bit:Yes, the C4 LVB barrier in one such design > example. It amounts to adding a test+jmp sequence in reference load path [no > memory access, virtually-always-predictable, devolves to a single u-op on > x86]. While there is some "cost" to this extra u-op, it's throughput impact > is insignificant when compared to other things (e.g. a typical 3-4x reduction > in newgen frequency). It is already significantly cheaper than the extra > write barrier work done in G1 for remembered set tracking for STW compaction, > for example. > So there's a test+jmp inserted before every load, which is pretty standard > and common fare in Java code. But how is it "no memory access"? IIRC, when > we last discussed the intricacies, a LVB would check whether a phase change > occurred, which I think is a memory read, isn't it? Or by "no memory access" > you mean it should still be valid in the cache? Also, I'm assuming once > inlining stops at some point, the LVB/phase change information needs to be > regathered by callees that were not inlined? So in other words, this will > devolve into another prologue (beyond call frame setup)? I can't see how this > won't impact i-cache or pollute the BTB history.
The test+jmp happens after a reference load, not before a load (or a store) that uses a reference [in this sense it is very different and less frequent than "use barrier" variants of read barriers, such at brooks style read barriers]. The test is between the just-loaded reference and a very stable (changes only when phase changes occur) value. That value can be read from memory (from a thread local location for sample), but can (and is) cached in a register. And that caching can be even done across frames, such that only leaving and entering Java (e.g. Making JNI calls) needs to re-establish it. The choice of keeping the value in a register or "spilling/filling" it from memory can be left to the register allocator (its just like anything else held in registers, except that "spilling" can be made free). There is certainly an I-cache footprint here, and also an increased density of branches (when compared to not having the test+jmp at all). It's just that those effects are tiny in the greater scheme of things. For some perspective: the I-cache and branch density impacts of lvbs in current implementations are much smaller than the extra I-cache and branch density impacts of the current G1 write barrier implementations (compared to the cheaper non-G1 write barriers). > > > On Thu, Dec 22, 2016 at 3:58 PM, Gil Tene <g...@azul.com> wrote: > > > > On Thursday, December 22, 2016 at 10:46:47 AM UTC-8, Vitaly Davidovich wrote: > > > > > On Thu, Dec 22, 2016 at 12:59 PM, Gil Tene <g...@azul.com> wrote: > > > > On Thursday, December 22, 2016 at 9:33:09 AM UTC-8, Vitaly Davidovich wrote: > > > > > On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote: > > Go's GC story is evolving. And it's doing so at a fairly quick pace. Judging > it based on a current implementation and current choices is premature, I > think. > Doesn't seem premature at all to me. Current implementation is what runs in > production, what people will experience, and what they should expect/write > their code towards. This is not too dissimilar from the Sufficiently Smart > Compiler - it *could* do something in *theory*, but practical/real > implementations don't. That's not to say their story won't improve over > time, but judging current impl is pretty fair IMO. > > > Looking at the current state of implementation to judge current use is fine. > But projecting from the current state to Go as a whole and assuming inherent > qualities (something that I see a lot of) would be similar to concluding that > "Java is slow because it is an interpreted language" and basing those > conclusions on 1997 JVMs. > Sure. > > > > > > > I think that the Go team's *implementation priority* focus on latency is a > good one. And unlike others, I don't actually think there is a strong latency > vs. throughput tradeoff in the long run. Achieving great latency in GC does > not inherently come at the expense of GC throughput or application > throughput. Both can one had at the same time. My view is that GC > implementations in general (for all runtimes) will move towards good latency > + good throughput at the same time at they mature. And Go is no different > here. > I don't see how minimizing latency cannot impact application throughput. The > impact can be minimized, but I don't see how it can be (nearly) zero. The > impact will vary from app to app, depending on how much it intrinsically can > make use of CPU before bottlenecking elsewhere and giving up CPU for reasons > other than GC. > > > The are tiny tradeoffs involved, but they are "in the noise", and are usually > immeasurable in practice. So yes, I'm saying that it is "nearly zero". > Concurrency and latency are not inherently coupled with efficiency choices in > GC work. In practice, they tend to have very little to do with each other. > They're necessarily coupled IMO. If a CPU is occupied with GC work that > could otherwise be used by the user app, it's a throughput hit. > > > You may be confusing what you know about the details of some "somewhat > concurrent, but not designed for simultaneous concurrency and efficiency" > implementations with what is inherent to performing concurrent GC work. > > > The bulk (as in the vast vast vast majority) of overall rate of CPU consumed > by GC work is an inherent result of allocation rate (which is a unit of > throughput). How much of that work you do per allocation unit is NOT strongly > affected by concurrency and interleaving choices in GC algorithms. I listed > examples of dramatically more dominant algorithmic choices that have nothing > to do with concurrency or incremental ability. > > > > > If extra instructions are executed on behalf of GC while also performing user > work, it's a loss in throughput (e.g. write barriers, indirection barriers > that may hijack execution, etc). > > > Again, most of that "extra instruction" cost is driven not by concurrency or > incremental ability, but by other large-throighpout-affecting choices. E.g. > the need for a write barrier is usually dictated by generational collection > (STW or not doesn't matter), not by concurrency. And adding a write barrier > (to achieve generational collection) is a proven way to dramatically increase > GC efficiency. > > > > > I'm not even talking about the possible d-cache and i-cache pollution, > cache misses, registers reserved to support GC, etc that may be involved. > > > The impacts of these is so tiny that it is hard to measure. Certainly a few > orders of magnitude smaller than e.g. generational collection yes/no > decisions. > > > > > > Since GC cannot know the allocation/memory pattern perfectly, it'll > ultimately perform more CPU work than could otherwise be necessary. This is > the price of admission for this type of automatic management. > > > When a concurrent GC's triggering and throttling mechanisms successfully get > to 98%+ occupancy on regular basis, this perceived cost fades away. And it is > easy and common to see that level of full use of the heap resources [yes, > "70% occupancy trigger" is a silly thing, but it is in no way inherent or > needed, it's just an artifact of some implementations]. Even where things are > more conservative (GC kicks off earlier than it should), the impacts of the > extra time spent tend to be significantly lower than other choices, and are > often offset by the easier choices that concurrency allows. > > > To use a pratical example: a concurrent generational collector can easily > dispense with the artificial line between oldgen and newgen heaps that many > STW collectors employ, because when neither generation incurs significant > pauses, the original reason for having the line there is gone (when oldgen > STW was significant, exhausting the oldgen early was unacceptable). This > allows the newgen to always use the entire available heap, reducing newgen > collection frequency by a significant factor, which eliminates a significant > amount of GC work (with the same application throughput in the same heap > size). This results in LESS work for GC, and higher throughput. This > increased efficiency is more an artifact of GC algorithm choice (share the > heap between old and newgen rather than segregate it) than a result of > concurrency. Concurrency simply enables that choice to be made much more > easily. > > > > > > There's nothing inherently wrong with it if that cost is acceptable. > Likewise, there's a development/mental cost to manual memory management as > well. It's all about priorities and cost. > > > A virtually-zero additional cost in CPU is acceptable by definition. > > > > > > > > > The fundamentals are simple: > > > For a given live set, heap set, and rate of allocation, a GC algorithm has a > certain amount of work to do. It needs to locate the dead material and > reclaim it such that it can be allocated into again. The choice of algorithms > used affect the amount of work needed to achieve this. But this amount of > work tends to be only minutely affected by concurrency concerns, and is > dramatically affected by other choices. E.g. the choice to use compacting > collectors dramatically affects the cost of allocation (bump pointer > allocation is dramatically cheaper than allocation in non-contiguously empty > regions). E.g. the choice of using a generational filter (or not) has several > orders of magnitude higher impact on GC throughput than the choice to do > concurrent or tiny-incremental-STW work (or not). The choice to use > evacuation as a compaction technique (common in copying and regional > collectors, but not in classic MarkSweepCompact) can dramatically affect > efficiency (in two opposite directions) and depends a lot more on the amount > of empty heap (aka "heap headroom") available than on concurrency or > incremental STW design choices. > Concurrent collectors, at least in Hotspot, have a tendency to be outpaced by > the mutators because they try to minimize throughput impact (i.e. they don't > run all the time, but get triggered when certain thresholds are breached at > certain check points). Then you have a problem of object graph mutation rate > (not allocation rate), which in the case of G1 will also increase SATB queue > processing and finalize marking phases. Yes, one can tune (or code) around > it, and throw lots more headroom when all else fails. > > > You note examples from partially concurrent collectors that had never started > with a "stay away from STW pauses" approach, and were not designed to > maintain collector efficiencies independent of mutation and/or allocation > rates. Future Go collectors do not have to mimic those design choices, and > neither are other already-successful concurrent collectors that produce both > consistent low latency and high throughput at the same time. > > > Being outpaced by mutators is a problem only if the algorithms you choose are > susceptible to such outpacing. Same for mutation rates. Making it impossible > to be outpaced by the mutator by guaranteeing single pass completion and > making the collector work invariant to mutation (e.g. by not needing to track > or react to mutations to complete a GC phase) eliminates those problems. They > are design choices driven by simple core statements, like "don't make > collector work per allocation unit vary with allocation rate" and "don't make > collector work per allocation unit vary with mutation rate". Both are doable > without impacting efficiency. > > > > So an alternative is you make mutator threads assist in relocations/fixups, > which I believe is what C4 does (in a nutshell, I'm glossing over a bunch of > things). But, this adds read barriers. I know we talked about their > purported cost before. Certainly one can try to minimize the impact, but I'm > having a hard time believing that it approaches the same level of impact as a > full STW collector. Of course the impact will vary based on app profile > (e.g. reference/pointer heavy vs flatter structures, object graph mutation > rate, etc). > > > There are many possible designs that can meet those needs without some > inherent sacrifice in efficiency. Especially when that efficiency is often > regained through some other side-effect. > > > Yes, the C4 LVB barrier in one such design example. It amounts to adding a > test+jmp sequence in reference load path [no memory access, > virtually-always-predictable, devolves to a single u-op on x86]. While there > is some "cost" to this extra u-op, it's throughput impact is insignificant > when compared to other things (e.g. a typical 3-4x reduction in newgen > frequency). It is already significantly cheaper than the extra write barrier > work done in G1 for remembered set tracking for STW compaction, for example. > > > An extreme way to think of a STW collector is as "the mutators are actually > doing all the GC work". It's just that the mutators have stopped doing any > mutator-useful work until they are done with all the GC work when a > collection is triggered. A concurrent collector shifts the bulk of the work > to background GC threads that may be able to do that work concurrently with > the mutator threads doing something useful. Loss of efficiency is in no way > inherent to that shift of work (at least not any more than STW parallelism > with work stealing queues is less efficient that single-threaded operation). > > A self-healing barrier (like the one used in C4) is one example of a design > mechanism that maintains this efficiency. Self-healing barriers eliminate the > inefficiency sometimes found in concurrent collectors buy ensuring that > barriers only [effectively] trigger at most once per work unit. This makes > the GC work invariant to heap topology, object graph mutations, etc., and to > who [mutator or GC thread] is actually doping the work. The work per > allocation unit then remains fairly fixed. The wish to shift the > vast-vast-vast majority of that work to collector threads is driven purely by > mutator latency concerns, but does not affect efficiency in any fundamental > or significant way. > > > > > Given the existence of concurrent collectors that are already able to sustain > higher application throughput that STW in many practical cases (e.g. > "Cassandra"), the "nearly free" or "even cheaper than free" argument has > practical proof points. > > > > > > > > A key architectural choice that has not been completely finalized in Go > (AFAICT) has to do with the ability to move objects. It is common to see > early GC-based runtime implementations that do not move objects around > succumb to an architectural "hole" that creates a reliance on that fact early > in the maturity cycles. This usually happens by allowing native code > extensions to assume object addresses are fixed (i.e. that objects will not > move during their lifetime) and getting stuck with that "assumed quality" in > the long run due to practical compatibility needs. From what I can tell, it > is not too late for Go to avoid this problem. Go still has precise GC > capability (it knows where all the heap references are), and technically can > still allow the movement of objects in the heap, even if current GC > implementations do not make use of that ability. My hope for Go is that it is > not too late to forgo the undesirable architectural limitation of requiring > objects to stay at fixed addresses, because much of what I think is ahead for > GC in Go may rely on breaking that assumption. [to be clear, supporting > various forms of temporary "pinning" of objects is fine, and does not > represent a strong architectural limitation.] > > > The ability to move objects is key for two main qualities that future Go GC's > (and GCs in pretty much any runtime) will find desirable: > > > 1. The ability to defragment the heap in order to support indefinite > execution lengths that do not depend on friendly behavior patterns for object > sizes over time. And defragmentation requires moving objects around before > they die. Heaps that can compact are dramatically more resilient than ones > that don't. Much like GC itself, self-defragmenting heaps allow the > relatively easy construction of software that is free to deal with > arbitrarily complex data shapes and scenarios. The value of this freedom is > often disputed by people who focus on "things that have worked in C/C++ for a > long time", but is easily seen in the multitude of application logic that > breaks when restricted to using "malloc'ed heap" or "arena allocator" memory > management mechanisms. > > > 2. The ability of move objects is fundamental to many of the techniques that > support high throughput GC and fast allocation logic. Generational GC is a > proven, nearly-universally-used technique that tends to improve GC throughput > by at least one order of magnitude [compared to non-generational uses of > similar algorithms] across a wide spectrum of application behaviors in many > different runtimes. It achieves that by fundamentally focusing some > efficient-at-sparse-occupancy algorithm on a known-to-be-sparse logical > subset of the heap. And it maintains the qualities needed (tracking the > logical subset of the heap, and keeping it sparse) by using object relocation > (e.g. for promotion). While there *may* exist alternative techniques that > achieve similar throughput improvements without moving objects, I don't know > of any that have been widely and successfully used so far, in any runtime. > > > I would predict that as Go's use application cases expand it's GC > implementations will evolve to include generational moving collectors. This > *does not* necessarily mean a monolithic STW copying newgen. Given their > [very correct] focus on latency, I'd expect even the newgen collectors in > future generational GC collectors to prioritize the minimization or > elimination of STW pauses, which will probably make them either fully > concurrent, incremental-STW (with viable arbitrarily small incremental > steps), or a combination of the two. > > > And yes, doing that (concurrent generational GC) while supporting great > latency, high throughput, and high efficiency all at the same time, is very > possible. It is NOT the hard or inherent tradeoff many people seem to believe > it to be. And practical proof points already exist. > If it's not "hard", why do we not see more GCs that have great throughput and > low latency? > > > Focus? Working on the right problems? > > > It all starts with saying that keeping latency down is a priority, and that > STW pauses (larger than other OS artifacts at least, at least) are > unacceptable. The rest is good engineering and implementation. > But why focus on latency if you're saying it's not "hard" to have low latency > and high throughput? > > > I used "hard" in "hard tradeoff" above in the sense off "rigid", "necessary", > or "inherent". Not in the sense "difficult" or "complicated". > > > As to why focus on latency? I'm not saying focusing only on latency is any > more right than focusing only on throughput. But ignoring latency in the core > design choices is something that its hard to correct later (as evidenced by > 20 years of attempts to deal with this in HotSpot collectors, that skipped > latency-derived core design drivers). > > > Focusing on Latency AND Throughout is the best way too go IMO. And I'm saying > that the two are in no way contradictory or opposite. They can sometimes even > complimentary. A high throughput concurrent collector is a very possible (and > very real) thing. > > > > The Go team have started from the latency side, and IMO they will make lots > of future progress on the throughput side. That's why I predict moving > generational collection and compacting oldgen in their future. But it's also > why I predict that those collectors will not be doing monolithic > stop-the-world moving and compaction. i.e. I think they will end up with > either concurrent or finely-incremental-STW relocation [of which I think > concurrent is both more likely and more desirable] in all generations. > > > > Surely if that's the case and you can have your cake and eat it too, it would > be done routinely? > > > We've been routinely eating our cake and having it with C4. And I firmly > believe there are at least 17 other ways for other to skin that same cat. A > focused set of people undeterred by "Surely if this was doable it would > already have been done" talk is all it takes. And especially since it > actually HAS been done at least once, the question of feasibility is already > answered. > > > > > I think Hotspot initially focused on pure throughput - when the java threads > are running, the only overhead is card marks; otherwise, they own the CPU for > the time slice they're there and all (otherwise) eligible CPUs are available > for the app to use. > > > The overall approach with HotSpot can be though of as "start with a STW > generational collector and increment your way to lower average/typical pauses > in the old generations". E.g. all practical collectors in HotSpot use a > monolithic stop-the-world copying newgen collector (not concurrent or > incremental, monolithic, run-to-completion STW, always). And all collectors > in HotSpot have a fallback to a full STW oldgen pause. The various > non-monolithic-STW "cool modes" (Partially Concurrent non-compacting Oldgen > in CMS, Incremental STW compaction in G1) are filters in the oldgen that > occur between full STW newgen pauses and full STW oldgen pauses. They are > "happy" when the filters work well enough to keep the STW newgen pauses in > the tens of msec and delay the full-STW pauses for hours or days. And they > consider it to be "ok and still working" even when those levels are not met. > > > > What HotSpot has never focused on is capping STW pause times, which would > require the fundamental way of collecting both the oldgen and the newgen to > now be STW (including any fallback). And since that has never been the focus, > not a lot of design has been put into achieving concurrency and throughput at > the same time. E.g. since the vast amount of CPU in HotSpot GC tends to be > spent in the newgen collectors, and those are always monolithic-STW (i.e. > they have zero concurrency with the application), considerations of > concurrency and throughout don't pop up to the top of anyone's list of > concerns. > > > > But that's just HotSpot. It's not "Java". Just one [very] popular > implementation of a Java runtime. > > > Go doesn't have to start from the same point. Instead they seem to be > starting from a "start with a low-latency GC and increment your way to higher > throughput". They seem to have taken that [very smart] step early [after > having a pretty annoying STW early history, which I'd equate to the "classic > VM" times in the early Java days], and I'm hoping they'll stick with that > path. > > > > > > > > > > In that sense, I think the Go team has the right focus and started from > working on the right problems. They have a bunch more engineering and > implementation work ahead to cover wider use cases, but their choices are > already having an effect. > > > > > > > > > > > On Thursday, December 22, 2016 at 4:12:14 AM UTC-8, Vitaly Davidovich wrote: > FWIW, I think the Go team is right in favoring lower latency over throughput > of their GC given the expected usage scenarios for Go. > > > In fact, most of the (Hotspot based) Java GC horror stories involve very long > pauses (G1 and CMS not excluded) - I've yet to hear anyone complain that > their "Big Data" servers are too slow because the GC is chipping away at > throughput too much (most of those servers aren't CPU bound to begin with or > can't become CPU bound due to bottlenecks elsewhere). > > > In the Hotspot world, there's also the unfortunate situation right now where > G1 isn't really solving any problem nor advancing the performance (latency) > boundaries. In fact, it taxes throughput quite heavily (e.g. significantly > more expensive write barriers) but isn't breaking any new ground. > > > > > > On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote: > > You might be interested in this Golang-group follow-up thread too… > https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4 (naturally a > more pro-Go > take on things…) > > > On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young wrote:Thought > people on here would enjoy this > > > https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ptdzwcmq2 > > > > > > -- > > > Studying for the Turing test > > > > > > > > > > > > -- > > > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > > > To unsubscribe from this group and stop receiving emails from it, send an > email to mechanical-sympathy+unsubscr...@googlegroups.com. > > > For more options, visit https://groups.google.com/d/optout. > > > > -- > > Sent from my phone > > > > > > > > -- > > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > > To unsubscribe from this group and stop receiving emails from it, send an > email to mechanical-sympathy+unsubscr...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > > > > -- > > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > > To unsubscribe from this group and stop receiving emails from it, send an > email to mechanical-sympathy+unsubscr...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > > > > -- > > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > > To unsubscribe from this group and stop receiving emails from it, send an > email to mechanical-sympathy+unsubscr...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.