On Thursday, December 22, 2016 at 2:18:30 PM UTC-8, Vitaly Davidovich wrote:
> So I agree that Go focusing on low latency is the way to go.  I'm not sold 
> that they can make it so low cost that building, for example, a high 
> performance database in Go would have comparable latency and throughput to 
> one built in a non-GC'd language/runtime.  All the "little/insignificant" 
> inefficiencies you brush away as being virtually zero/undetectable/etc add 
> up.  Heck, we've talked about the costs of context (or mode) switches on this 
> thread before, and how they can be dominated by the caches needing to be 
> warmed up (vs register/context being saved/restored).  But you deem that type 
> of cost negligible.  I'll agree that it's negligible compared to 100s of 
> millis (or worse - multiple secs or tens of secs) GC pauses.  It's definitely 
> not negligible for the lower level systems being built.
> 
> 
> I'm curious about this bit:Yes, the C4 LVB barrier in one such design 
> example. It amounts to adding a test+jmp sequence in reference load path [no 
> memory access, virtually-always-predictable, devolves to a single u-op on 
> x86]. While there is some "cost" to this extra u-op, it's throughput impact 
> is insignificant when compared to other things (e.g. a typical 3-4x reduction 
> in newgen frequency). It is already significantly cheaper than the extra 
> write barrier work done in G1 for remembered set tracking for STW compaction, 
> for example.
> So there's a test+jmp inserted before every load, which is pretty standard 
> and common fare in Java code.  But how is it "no memory access"? IIRC, when 
> we last discussed the intricacies, a LVB would check whether a phase change 
> occurred, which I think is a memory read, isn't it? Or by "no memory access" 
> you mean it should still be valid in the cache? Also, I'm assuming once 
> inlining stops at some point, the LVB/phase change information needs to be 
> regathered by callees that were not inlined? So in other words, this will 
> devolve into another prologue (beyond call frame setup)? I can't see how this 
> won't impact i-cache or pollute the BTB history.

The test+jmp happens after a reference load, not before a load (or a store) 
that uses a reference [in this sense it is very different and less frequent 
than "use barrier" variants of read barriers, such at brooks style read 
barriers].

The test is between the just-loaded reference and a very stable (changes only 
when phase changes occur) value. That value can be read from memory (from a 
thread local location for sample), but can (and is) cached in a register. And 
that caching can be even done across frames, such that only leaving and 
entering Java (e.g. Making JNI calls) needs to re-establish it. The choice of 
keeping the value in a register or "spilling/filling" it from memory can be 
left to the register allocator (its just like anything else held in registers, 
except that "spilling" can be made free).

There is certainly an I-cache footprint here, and also an increased density of 
branches (when compared to not having the test+jmp at all). It's just that 
those effects are tiny in the greater scheme of things. For some perspective: 
the I-cache and branch density impacts of lvbs in current implementations are 
much smaller than the extra I-cache and branch density impacts of the current 
G1 write barrier implementations (compared to the cheaper non-G1 write 
barriers).

> 
> 
> On Thu, Dec 22, 2016 at 3:58 PM, Gil Tene <g...@azul.com> wrote:
> 
> 
> 
> On Thursday, December 22, 2016 at 10:46:47 AM UTC-8, Vitaly Davidovich wrote:
> 
> 
> 
> 
> On Thu, Dec 22, 2016 at 12:59 PM, Gil Tene <g...@azul.com> wrote:
> 
> 
> 
> On Thursday, December 22, 2016 at 9:33:09 AM UTC-8, Vitaly Davidovich wrote:
> 
> 
> 
> 
> On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote:
> 
> Go's GC story is evolving. And it's doing so at a fairly quick pace. Judging 
> it based on a current implementation and current choices is premature, I 
> think.
> Doesn't seem premature at all to me.  Current implementation is what runs in 
> production, what people will experience, and what they should expect/write 
> their code towards.  This is not too dissimilar from the Sufficiently Smart 
> Compiler - it *could* do something in *theory*, but practical/real 
> implementations don't.  That's not to say their story won't improve over 
> time, but judging current impl is pretty fair IMO. 
> 
> 
> Looking at the current state of implementation to judge current use is fine. 
> But projecting from the current state to Go as a whole and assuming inherent 
> qualities (something that I see a lot of) would be similar to concluding that 
> "Java is slow because it is an interpreted language" and basing those 
> conclusions on 1997 JVMs.
>  Sure. 
> 
>  
> 
> 
> 
> 
> I think that the Go team's *implementation priority* focus on latency is a 
> good one. And unlike others, I don't actually think there is a strong latency 
> vs. throughput tradeoff in the long run. Achieving great latency in GC does 
> not inherently come at the expense of GC throughput or application 
> throughput. Both can one had at the same time. My view is that GC 
> implementations in general (for all runtimes) will move towards good latency 
> + good throughput at the same time at they mature. And Go is no different 
> here.
> I don't see how minimizing latency cannot impact application throughput.  The 
> impact can be minimized, but I don't see how it can be (nearly) zero.  The 
> impact will vary from app to app, depending on how much it intrinsically can 
> make use of CPU before bottlenecking elsewhere and giving up CPU for reasons 
> other than GC. 
> 
> 
> The are tiny tradeoffs involved, but they are "in the noise", and are usually 
> immeasurable in practice. So yes, I'm saying that it is "nearly zero". 
> Concurrency and latency are not inherently coupled with efficiency choices in 
> GC work. In practice, they tend to have very little to do with each other.
> They're necessarily coupled IMO.  If a CPU is occupied with GC work that 
> could otherwise be used by the user app, it's a throughput hit.  
> 
> 
> You may be confusing what you know about the details of some "somewhat 
> concurrent, but not designed for simultaneous concurrency and efficiency" 
> implementations with what is inherent to performing concurrent GC work.
> 
> 
> The bulk (as in the vast vast vast majority) of overall rate of CPU consumed 
> by GC work is an inherent result of allocation rate (which is a unit of 
> throughput). How much of that work you do per allocation unit is NOT strongly 
> affected by concurrency and interleaving choices in GC algorithms. I listed 
> examples of dramatically more dominant algorithmic choices that have nothing 
> to do with concurrency or incremental ability.
>  
> 
> 
> 
> If extra instructions are executed on behalf of GC while also performing user 
> work, it's a loss in throughput (e.g. write barriers, indirection barriers 
> that may hijack execution, etc).
> 
> 
> Again, most of that "extra instruction" cost is driven not by concurrency or 
> incremental ability, but by other large-throighpout-affecting choices. E.g. 
> the need for a write barrier is usually dictated by generational collection 
> (STW or not doesn't matter), not by concurrency. And adding a write barrier 
> (to achieve generational collection) is a proven way to dramatically increase 
> GC efficiency.
>  
> 
> 
> 
>   I'm not even talking about the possible d-cache and i-cache pollution, 
> cache misses, registers reserved to support GC, etc that may be involved.
> 
> 
> The impacts of these is so tiny that it is hard to measure. Certainly a few 
> orders of magnitude smaller than e.g. generational collection yes/no 
> decisions.
> 
> 
> 
> 
> 
> Since GC cannot know the allocation/memory pattern perfectly, it'll 
> ultimately perform more CPU work than could otherwise be necessary.  This is 
> the price of admission for this type of automatic management.
> 
> 
> When a concurrent GC's triggering and throttling mechanisms successfully get 
> to 98%+ occupancy on regular basis, this perceived cost fades away. And it is 
> easy and common to see that level of full use of the heap resources [yes, 
> "70% occupancy trigger" is a silly thing, but it is in no way inherent or 
> needed, it's just an artifact of some implementations]. Even where things are 
> more conservative (GC kicks off earlier than it should), the impacts of the 
> extra time spent tend to be significantly lower than other choices, and are 
> often offset by the easier choices that concurrency allows.
> 
> 
> To use a pratical example: a concurrent generational collector can easily 
> dispense with the artificial line between oldgen and newgen heaps that many 
> STW collectors employ, because when neither generation incurs significant 
> pauses, the original reason for having the line there is gone (when oldgen 
> STW was significant, exhausting the oldgen early was unacceptable). This 
> allows the newgen to always use the entire available heap, reducing newgen 
> collection frequency by a significant factor, which eliminates a significant 
> amount of GC work (with the same application throughput in the same heap 
> size). This results in LESS work for GC, and higher throughput. This 
> increased efficiency is more an artifact of GC algorithm choice (share the 
> heap between old and newgen rather than segregate it) than a result of 
> concurrency. Concurrency simply enables that choice to be made much more 
> easily.  
> 
> 
> 
> 
> 
> There's nothing inherently wrong with it if that cost is acceptable.  
> Likewise, there's a development/mental cost to manual memory management as 
> well.  It's all about priorities and cost.
> 
> 
> A virtually-zero additional cost in CPU is acceptable by definition.
> 
> 
> 
> 
> 
> 
> 
> 
> The fundamentals are simple:
> 
> 
> For a given live set, heap set, and rate of allocation, a GC algorithm has a 
> certain amount of work to do. It needs to locate the dead material and 
> reclaim it such that it can be allocated into again. The choice of algorithms 
> used affect the amount of work needed to achieve this. But this amount of 
> work tends to be only minutely affected by concurrency concerns, and is 
> dramatically affected by other choices. E.g. the choice to use compacting 
> collectors dramatically affects the cost of allocation (bump pointer 
> allocation is dramatically cheaper than allocation in non-contiguously empty 
> regions). E.g. the choice of using a generational filter (or not) has several 
> orders of magnitude higher impact on GC throughput than the choice to do 
> concurrent or tiny-incremental-STW work (or not). The choice to use 
> evacuation as a compaction technique (common in copying and regional 
> collectors, but not in classic MarkSweepCompact) can dramatically affect 
> efficiency (in two opposite directions) and depends a lot more on the amount 
> of empty heap (aka "heap headroom") available than on concurrency or 
> incremental STW design choices.
> Concurrent collectors, at least in Hotspot, have a tendency to be outpaced by 
> the mutators because they try to minimize throughput impact (i.e. they don't 
> run all the time, but get triggered when certain thresholds are breached at 
> certain check points).  Then you have a problem of object graph mutation rate 
> (not allocation rate), which in the case of G1 will also increase SATB queue 
> processing and finalize marking phases.  Yes, one can tune (or code) around 
> it, and throw lots more headroom when all else fails.
> 
> 
> You note examples from partially concurrent collectors that had never started 
> with a "stay away from STW pauses" approach, and were not designed to 
> maintain collector efficiencies independent of mutation and/or allocation 
> rates. Future Go collectors do not have to mimic those design choices, and 
> neither are other already-successful concurrent collectors that produce both 
> consistent low latency and high throughput at the same time.
> 
> 
> Being outpaced by mutators is a problem only if the algorithms you choose are 
> susceptible to such outpacing. Same for mutation rates. Making it impossible 
> to be outpaced by the mutator by guaranteeing single pass completion and 
> making the collector work invariant to mutation (e.g. by not needing to track 
> or react to mutations to complete a GC phase) eliminates those problems. They 
> are design choices driven by simple core statements, like "don't make 
> collector work per allocation unit vary with allocation rate" and "don't make 
> collector work per allocation unit vary with mutation rate". Both are doable 
> without impacting efficiency.
>  
> 
> 
> So an alternative is you make mutator threads assist in relocations/fixups, 
> which I believe is what C4 does (in a nutshell, I'm glossing over a bunch of 
> things).  But, this adds read barriers.  I know we talked about their 
> purported cost before.  Certainly one can try to minimize the impact, but I'm 
> having a hard time believing that it approaches the same level of impact as a 
> full STW collector.  Of course the impact will vary based on app profile 
> (e.g. reference/pointer heavy vs flatter structures, object graph mutation 
> rate, etc).
> 
> 
> There are many possible designs that can meet those needs without some 
> inherent sacrifice in efficiency. Especially when that efficiency is often 
> regained through some other side-effect.
> 
> 
> Yes, the C4 LVB barrier in one such design example. It amounts to adding a 
> test+jmp sequence in reference load path [no memory access, 
> virtually-always-predictable, devolves to a single u-op on x86]. While there 
> is some "cost" to this extra u-op, it's throughput impact is insignificant 
> when compared to other things (e.g. a typical 3-4x reduction in newgen 
> frequency). It is already significantly cheaper than the extra write barrier 
> work done in G1 for remembered set tracking for STW compaction, for example.
> 
> 
> An extreme way to think of a STW collector is as "the mutators are actually 
> doing all the GC work". It's just that the mutators have stopped doing any 
> mutator-useful work until they are done with all the GC work when a 
> collection is triggered. A concurrent collector shifts the bulk of the work 
> to background GC threads that may be able to do that work concurrently with 
> the mutator threads doing something useful. Loss of efficiency is in no way 
> inherent to that shift of work (at least not any more than STW parallelism 
> with work stealing queues is less efficient that single-threaded operation).
>   
> A self-healing barrier (like the one used in C4) is one example of a design 
> mechanism that maintains this efficiency. Self-healing barriers eliminate the 
> inefficiency sometimes found in concurrent collectors buy ensuring that 
> barriers only [effectively] trigger at most once per work unit. This makes 
> the GC work invariant to heap topology, object graph mutations, etc., and to 
> who [mutator or GC thread] is actually doping the work. The work per 
> allocation unit then remains fairly fixed. The wish to shift the 
> vast-vast-vast majority of that work to collector threads is driven purely by 
> mutator latency concerns, but does not affect efficiency in any fundamental 
> or significant way. 
>  
> 
> 
> 
> Given the existence of concurrent collectors that are already able to sustain 
> higher application throughput that STW in many practical cases (e.g. 
> "Cassandra"), the "nearly free" or "even cheaper than free" argument has 
> practical proof points. 
>  
> 
> 
> 
> 
> 
> 
> A key architectural choice that has not been completely finalized in Go 
> (AFAICT) has to do with the ability to move objects. It is common to see 
> early GC-based runtime implementations that do not move objects around 
> succumb to an architectural "hole" that creates a reliance on that fact early 
> in the maturity cycles. This usually happens by allowing native code 
> extensions to assume object addresses are fixed (i.e. that objects will not 
> move during their lifetime) and getting stuck with that "assumed quality" in 
> the long run due to practical compatibility needs. From what I can tell, it 
> is not too late for Go to avoid this problem. Go still has precise GC 
> capability (it knows where all the heap references are), and technically can 
> still allow the movement of objects in the heap, even if current GC 
> implementations do not make use of that ability. My hope for Go is that it is 
> not too late to forgo the undesirable architectural limitation of requiring 
> objects to stay at fixed addresses, because much of what I think is ahead for 
> GC in Go may rely on breaking that assumption. [to be clear, supporting 
> various forms of temporary "pinning" of objects is fine, and does not 
> represent a strong architectural limitation.]
> 
> 
> The ability to move objects is key for two main qualities that future Go GC's 
> (and GCs in pretty much any runtime) will find desirable:
> 
> 
> 1. The ability to defragment the heap in order to support indefinite 
> execution lengths that do not depend on friendly behavior patterns for object 
> sizes over time. And defragmentation requires moving objects around before 
> they die. Heaps that can compact are dramatically more resilient than ones 
> that don't. Much like GC itself, self-defragmenting heaps allow the 
> relatively easy construction of software that is free to deal with 
> arbitrarily complex data shapes and scenarios. The value of this freedom is 
> often disputed by people who focus on "things that have worked in C/C++ for a 
> long time", but is easily seen in the multitude of application logic that 
> breaks when restricted to using "malloc'ed heap" or "arena allocator" memory 
> management mechanisms.
> 
> 
> 2. The ability of move objects is fundamental to many of the techniques that 
> support high throughput GC and fast allocation logic. Generational GC is a 
> proven, nearly-universally-used technique that tends to improve GC throughput 
> by at least one order of magnitude [compared to non-generational uses of 
> similar algorithms] across a wide spectrum of application behaviors in many 
> different runtimes. It achieves that by fundamentally focusing some 
> efficient-at-sparse-occupancy algorithm on a known-to-be-sparse logical 
> subset of the heap. And it maintains the qualities needed (tracking the 
> logical subset of the heap, and keeping it sparse) by using object relocation 
> (e.g. for promotion). While there *may* exist alternative techniques that 
> achieve similar throughput improvements without moving objects, I don't know 
> of any that have been widely and successfully used so far, in any runtime.
> 
> 
> I would predict that as Go's use application cases expand it's GC 
> implementations will evolve to include generational moving collectors. This 
> *does not* necessarily mean a monolithic STW copying newgen. Given their 
> [very correct] focus on latency, I'd expect even the newgen collectors in 
> future generational GC collectors to prioritize the minimization or 
> elimination of STW pauses, which will probably make them either fully 
> concurrent, incremental-STW (with viable arbitrarily small incremental 
> steps), or a combination of the two.
> 
> 
> And yes, doing that (concurrent generational GC) while supporting great 
> latency, high throughput, and high efficiency all at the same time, is very 
> possible. It is NOT the hard or inherent tradeoff many people seem to believe 
> it to be. And practical proof points already exist.
> If it's not "hard", why do we not see more GCs that have great throughput and 
> low latency? 
> 
> 
> Focus? Working on the right problems?
> 
> 
> It all starts with saying that keeping latency down is a priority, and that 
> STW pauses (larger than other OS artifacts at least, at least) are 
> unacceptable. The rest is good engineering and implementation.
> But why focus on latency if you're saying it's not "hard" to have low latency 
> and high throughput? 
> 
> 
> I used "hard" in "hard tradeoff" above in the sense off "rigid", "necessary", 
> or "inherent". Not in the sense "difficult" or "complicated". 
> 
> 
> As to why focus on latency? I'm not saying focusing only on latency is any 
> more right than focusing only on throughput. But ignoring latency in the core 
> design choices is something that its hard to correct later (as evidenced by 
> 20 years of attempts to deal with this in HotSpot collectors, that skipped 
> latency-derived core design drivers).
> 
> 
> Focusing on Latency AND Throughout is the best way too go IMO. And I'm saying 
> that the two are in no way contradictory or opposite. They can sometimes even 
> complimentary. A high throughput concurrent collector is a very possible (and 
> very real) thing.
> 
> 
> 
> The Go team have started from the latency side, and IMO they will make lots 
> of future progress on the throughput side. That's why I predict moving 
> generational collection and compacting oldgen in their future. But it's also 
> why I predict that those collectors will not be doing monolithic 
> stop-the-world moving and compaction. i.e. I think they will end up with 
> either concurrent or finely-incremental-STW relocation [of which I think 
> concurrent is both more likely and more desirable] in all generations.
>  
> 
> 
> Surely if that's the case and you can have your cake and eat it too, it would 
> be done routinely?
> 
> 
> We've been routinely eating our cake and having it with C4. And I firmly 
> believe there are at least 17 other ways for other to skin that same cat. A 
> focused set of people undeterred by "Surely if this was doable it would 
> already have been done" talk is all it takes. And especially since it 
> actually HAS been done at least once, the question of feasibility is already 
> answered.
> 
>  
> 
> 
> I think Hotspot initially focused on pure throughput - when the java threads 
> are running, the only overhead is card marks; otherwise, they own the CPU for 
> the time slice they're there and all (otherwise) eligible CPUs are available 
> for the app to use.
> 
> 
> The overall approach with HotSpot can be though of as "start with a STW 
> generational collector and increment your way to lower average/typical pauses 
> in the old generations". E.g. all practical collectors in HotSpot use a 
> monolithic stop-the-world copying newgen collector (not concurrent or 
> incremental, monolithic, run-to-completion STW, always). And all collectors 
> in HotSpot have a fallback to a full STW oldgen pause. The various 
> non-monolithic-STW "cool modes" (Partially Concurrent non-compacting Oldgen 
> in CMS, Incremental STW compaction in G1) are filters in the oldgen that 
> occur between full STW newgen pauses and full STW oldgen pauses. They are 
> "happy" when the filters work well enough to keep the STW newgen pauses in 
> the tens of msec and delay the full-STW pauses for hours or days. And they 
> consider it to be "ok and still working" even when those levels are not met.
> 
> 
> 
> What HotSpot has never focused on is capping STW pause times, which would 
> require the fundamental way of collecting both the oldgen and the newgen to 
> now be STW (including any fallback). And since that has never been the focus, 
> not a lot of design has been put into achieving concurrency and throughput at 
> the same time. E.g. since the vast amount of CPU in HotSpot GC tends to be 
> spent in the newgen collectors, and those are always monolithic-STW (i.e. 
> they have zero concurrency with the application), considerations of 
> concurrency and throughout don't pop up to the top of anyone's list of 
> concerns.
> 
> 
> 
> But that's just HotSpot. It's not "Java". Just one [very] popular 
> implementation of a Java runtime.
> 
> 
> Go doesn't have to start from the same point. Instead they seem to be 
> starting from a "start with a low-latency GC and increment your way to higher 
> throughput". They seem to have taken that [very smart] step early [after 
> having a pretty annoying STW early history, which I'd equate to the "classic 
> VM" times in the early Java days], and I'm hoping they'll stick with that 
> path.
> 
> 
>  
> 
> 
> 
> 
> 
> 
> In that sense, I think the Go team has the right focus and started from 
> working on the right problems. They have a bunch more engineering and 
> implementation work ahead to cover wider use cases, but their choices are 
> already having an effect.
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> On Thursday, December 22, 2016 at 4:12:14 AM UTC-8, Vitaly Davidovich wrote:
> FWIW, I think the Go team is right in favoring lower latency over throughput 
> of their GC given the expected usage scenarios for Go.
> 
> 
> In fact, most of the (Hotspot based) Java GC horror stories involve very long 
> pauses (G1 and CMS not excluded) - I've yet to hear anyone complain that 
> their "Big Data" servers are too slow because the GC is chipping away at 
> throughput too much (most of those servers aren't CPU bound to begin with or 
> can't become CPU bound due to bottlenecks elsewhere).
> 
> 
> In the Hotspot world, there's also the unfortunate situation right now where 
> G1 isn't really solving any problem nor advancing the performance (latency) 
> boundaries.  In fact, it taxes throughput quite heavily (e.g. significantly 
> more expensive write barriers) but isn't breaking any new ground.
> 
> 
> 
> 
> 
> On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote:
> 
> You might be interested in this Golang-group follow-up thread too… 
> https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4 (naturally a 
> more pro-Go
> take on things…)
> 
> 
> On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young wrote:Thought 
> people on here would enjoy this
> 
> 
> https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ptdzwcmq2
> 
> 
> 
> 
> 
> -- 
> 
> 
> Studying for the Turing test
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> 
> You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
> 
> 
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> 
> 
> For more options, visit https://groups.google.com/d/optout.
> 
> 
> 
> -- 
> 
> Sent from my phone
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
> 
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> 
> For more options, visit https://groups.google.com/d/optout.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
> 
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> 
> For more options, visit https://groups.google.com/d/optout.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
> 
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> 
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to