On Thu, Dec 22, 2016 at 12:59 PM, Gil Tene <g...@azul.com> wrote:

>
>
> On Thursday, December 22, 2016 at 9:33:09 AM UTC-8, Vitaly Davidovich
> wrote:
>>
>>
>>
>> On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote:
>>
>>> Go's GC story is evolving. And it's doing so at a fairly quick pace.
>>> Judging it based on a current implementation and current choices is
>>> premature, I think.
>>>
>> Doesn't seem premature at all to me.  Current implementation is what runs
>> in production, what people will experience, and what they should
>> expect/write their code towards.  This is not too dissimilar from the
>> Sufficiently Smart Compiler - it *could* do something in *theory*, but
>> practical/real implementations don't.  That's not to say their story won't
>> improve over time, but judging current impl is pretty fair IMO.
>>
>
> Looking at the current state of implementation to judge current use is
> fine. But projecting from the current state to Go as a whole and assuming
> inherent qualities (something that I see a lot of) would be similar to
> concluding that "Java is slow because it is an interpreted language" and
> basing those conclusions on 1997 JVMs.
>
 Sure.

>
>
>> I think that the Go team's *implementation priority* focus on latency is
>>> a good one. And unlike others, I don't actually think there is a strong
>>> latency vs. throughput tradeoff in the long run. Achieving great latency in
>>> GC does not inherently come at the expense of GC throughput or application
>>> throughput. Both can one had at the same time. My view is that GC
>>> implementations in general (for all runtimes) will move towards good
>>> latency + good throughput at the same time at they mature. And Go is no
>>> different here.
>>>
>> I don't see how minimizing latency cannot impact application throughput.
>> The impact can be minimized, but I don't see how it can be (nearly) zero.
>> The impact will vary from app to app, depending on how much it
>> intrinsically can make use of CPU before bottlenecking elsewhere and giving
>> up CPU for reasons other than GC.
>>
>
> The are tiny tradeoffs involved, but they are "in the noise", and are
> usually immeasurable in practice. So yes, I'm saying that it is "nearly
> zero". Concurrency and latency are not inherently coupled with efficiency
> choices in GC work. In practice, they tend to have very little to do with
> each other.
>
They're necessarily coupled IMO.  If a CPU is occupied with GC work that
could otherwise be used by the user app, it's a throughput hit.  If extra
instructions are executed on behalf of GC while also performing user work,
it's a loss in throughput (e.g. write barriers, indirection barriers that
may hijack execution, etc).  I'm not even talking about the possible
d-cache and i-cache pollution, cache misses, registers reserved to support
GC, etc that may be involved.  Since GC cannot know the allocation/memory
pattern perfectly, it'll ultimately perform more CPU work than could
otherwise be necessary.  This is the price of admission for this type of
automatic management.  There's nothing inherently wrong with it if that
cost is acceptable.  Likewise, there's a development/mental cost to manual
memory management as well.  It's all about priorities and cost.

>
> The fundamentals are simple:
>
> For a given live set, heap set, and rate of allocation, a GC algorithm has
> a certain amount of work to do. It needs to locate the dead material and
> reclaim it such that it can be allocated into again. The choice of
> algorithms used affect the amount of work needed to achieve this. But this
> amount of work tends to be only minutely affected by concurrency concerns,
> and is dramatically affected by other choices. E.g. the choice to use
> compacting collectors dramatically affects the cost of allocation (bump
> pointer allocation is dramatically cheaper than allocation in
> non-contiguously empty regions). E.g. the choice of using a generational
> filter (or not) has several orders of magnitude higher impact on GC
> throughput than the choice to do concurrent or tiny-incremental-STW work
> (or not). The choice to use evacuation as a compaction technique (common in
> copying and regional collectors, but not in classic MarkSweepCompact) can
> dramatically affect efficiency (in two opposite directions) and depends a
> lot more on the amount of empty heap (aka "heap headroom") available than
> on concurrency or incremental STW design choices.
>
Concurrent collectors, at least in Hotspot, have a tendency to be outpaced
by the mutators because they try to minimize throughput impact (i.e. they
don't run all the time, but get triggered when certain thresholds are
breached at certain check points).  Then you have a problem of object graph
mutation rate (not allocation rate), which in the case of G1 will also
increase SATB queue processing and finalize marking phases.  Yes, one can
tune (or code) around it, and throw lots more headroom when all else fails.

So an alternative is you make mutator threads assist in relocations/fixups,
which I believe is what C4 does (in a nutshell, I'm glossing over a bunch
of things).  But, this adds read barriers.  I know we talked about their
purported cost before.  Certainly one can try to minimize the impact, but
I'm having a hard time believing that it approaches the same level of
impact as a full STW collector.  Of course the impact will vary based on
app profile (e.g. reference/pointer heavy vs flatter structures, object
graph mutation rate, etc).


> Given the existence of concurrent collectors that are already able to
> sustain higher application throughput that STW in many practical cases
> (e.g. "Cassandra"), the "nearly free" or "even cheaper than free" argument
> has practical proof points.
>
>
>>
>>> A key architectural choice that has not been completely finalized in Go
>>> (AFAICT) has to do with the ability to move objects. It is common to see
>>> early GC-based runtime implementations that do not move objects around
>>> succumb to an architectural "hole" that creates a reliance on that fact
>>> early in the maturity cycles. This usually happens by allowing native code
>>> extensions to assume object addresses are fixed (i.e. that objects will not
>>> move during their lifetime) and getting stuck with that "assumed quality"
>>> in the long run due to practical compatibility needs. From what I can tell,
>>> it is not too late for Go to avoid this problem. Go still has precise GC
>>> capability (it knows where all the heap references are), and technically
>>> can still allow the movement of objects in the heap, even if current GC
>>> implementations do not make use of that ability. My hope for Go is that it
>>> is not too late to forgo the undesirable architectural limitation of
>>> requiring objects to stay at fixed addresses, because much of what I think
>>> is ahead for GC in Go may rely on breaking that assumption. [to be clear,
>>> supporting various forms of temporary "pinning" of objects is fine, and
>>> does not represent a strong architectural limitation.]
>>>
>>> The ability to move objects is key for two main qualities that future Go
>>> GC's (and GCs in pretty much any runtime) will find desirable:
>>>
>>> 1. The ability to defragment the heap in order to support indefinite
>>> execution lengths that do not depend on friendly behavior patterns for
>>> object sizes over time. And defragmentation requires moving objects around
>>> before they die. Heaps that can compact are dramatically more resilient
>>> than ones that don't. Much like GC itself, self-defragmenting heaps allow
>>> the relatively easy construction of software that is free to deal with
>>> arbitrarily complex data shapes and scenarios. The value of this freedom is
>>> often disputed by people who focus on "things that have worked in C/C++ for
>>> a long time", but is easily seen in the multitude of application logic that
>>> breaks when restricted to using "malloc'ed heap" or "arena allocator"
>>> memory management mechanisms.
>>>
>>> 2. The ability of move objects is fundamental to many of the techniques
>>> that support high throughput GC and fast allocation logic. Generational GC
>>> is a proven, nearly-universally-used technique that tends to improve GC
>>> throughput by at least one order of magnitude [compared to non-generational
>>> uses of similar algorithms] across a wide spectrum of application behaviors
>>> in many different runtimes. It achieves that by fundamentally focusing some
>>> efficient-at-sparse-occupancy algorithm on a known-to-be-sparse logical
>>> subset of the heap. And it maintains the qualities needed (tracking the
>>> logical subset of the heap, and keeping it sparse) by using object
>>> relocation (e.g. for promotion). While there *may* exist alternative
>>> techniques that achieve similar throughput improvements without moving
>>> objects, I don't know of any that have been widely and successfully used so
>>> far, in any runtime.
>>>
>>> I would predict that as Go's use application cases expand it's GC
>>> implementations will evolve to include generational moving collectors. This
>>> *does not* necessarily mean a monolithic STW copying newgen. Given their
>>> [very correct] focus on latency, I'd expect even the newgen collectors in
>>> future generational GC collectors to prioritize the minimization or
>>> elimination of STW pauses, which will probably make them either fully
>>> concurrent, incremental-STW (with viable arbitrarily small incremental
>>> steps), or a combination of the two.
>>>
>>> And yes, doing that (concurrent generational GC) while supporting great
>>> latency, high throughput, and high efficiency all at the same time, is very
>>> possible. It is NOT the hard or inherent tradeoff many people seem to
>>> believe it to be. And practical proof points already exist.
>>>
>> If it's not "hard", why do we not see more GCs that have great throughput
>> and low latency?
>>
>
> Focus? Working on the right problems?
>
> It all starts with saying that keeping latency down is a priority, and
> that STW pauses (larger than other OS artifacts at least, at least) are
> unacceptable. The rest is good engineering and implementation.
>
But why focus on latency if you're saying it's not "hard" to have low
latency and high throughput? Surely if that's the case and you can have
your cake and eat it too, it would be done routinely? I think Hotspot
initially focused on pure throughput - when the java threads are running,
the only overhead is card marks; otherwise, they own the CPU for the time
slice they're there and all (otherwise) eligible CPUs are available for the
app to use.

>
> In that sense, I think the Go team has the right focus and started from
> working on the right problems. They have a bunch more engineering and
> implementation work ahead to cover wider use cases, but their choices are
> already having an effect.
>
>
>
>>
>>> On Thursday, December 22, 2016 at 4:12:14 AM UTC-8, Vitaly Davidovich
>>> wrote:
>>>>
>>>> FWIW, I think the Go team is right in favoring lower latency over
>>>> throughput of their GC given the expected usage scenarios for Go.
>>>>
>>>> In fact, most of the (Hotspot based) Java GC horror stories involve
>>>> very long pauses (G1 and CMS not excluded) - I've yet to hear anyone
>>>> complain that their "Big Data" servers are too slow because the GC is
>>>> chipping away at throughput too much (most of those servers aren't CPU
>>>> bound to begin with or can't become CPU bound due to bottlenecks 
>>>> elsewhere).
>>>>
>>>> In the Hotspot world, there's also the unfortunate situation right now
>>>> where G1 isn't really solving any problem nor advancing the performance
>>>> (latency) boundaries.  In fact, it taxes throughput quite heavily (e.g.
>>>> significantly more expensive write barriers) but isn't breaking any new
>>>> ground.
>>>>
>>>>
>>>> On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote:
>>>>
>>>>> You might be interested in this Golang-group follow-up thread too…
>>>>> https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4
>>>>> (naturally a more pro-Go
>>>>> take on things…)
>>>>>
>>>>> On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young wrote:
>>>>>>
>>>>>> Thought people on here would enjoy this
>>>>>>
>>>>>>
>>>>>> https://medium.com/@octskyward/modern-garbage-collection-
>>>>>> 911ef4f8bd8e#.ptdzwcmq2
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>> Studying for the Turing test
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "mechanical-sympathy" group.
>>>>>
>>>>>
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>>>>
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>>
>>>>> --
>>>> Sent from my phone
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to