Re: Low-Overhead Heap Profiling

Tony Printezis Thu, 25 Jun 2015 13:30:12 -0700

Hi Jeremy,

Inline.


On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com) wrote:



On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <tprinte...@twitter.com> wrote:
Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied 
to the sampling rate.  There is no reason to do that.  I want reasonable 
statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I 
never suggested to resize TLABs in order to vary the sampling rate. (Apologies 
if my e-mail was not clear.)


My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs entirely


This is correct: We’ll also have to intercept the outside-TLAB allocs. But, 
IMHO, this is a feature as it’s helpful to know how many (and which) 
allocations happen outside TLABs. These are generally very infrequent (and slow 
anyway), so sampling all of those, instead of only sampling some of them, does 
not have much of an overhead. But, you could also do sampling for the 
outside-TLAB allocs too, if you want: just accumulate their size on a separate 
per-thread counter and sample the one that bumps that counter goes over a limit.

An additional observation (orthogonal to the main point, but I thought I’d 
mention it anyway): For the outside-TLAB allocs it’d be helpful to also know 
which generation the object ended up in (e.g., young gen or direct-to-old-gen). 
This is very helpful in some situations when you’re trying to work out which 
allocation(s) grew the old gen occupancy between two young GCs.

FWIW, the existing JFR events follow the approach I described above:

* one event for each new TLAB + first alloc in that TLAB (my proposal basically 
generalizes this and removes the 1-1 relationship between object alloc sampling 
and new TLAB operation)

* one event for all allocs outside a TLAB

I think the above separation is helpful. But if you think it could confuse 
users, you can of course easily just combine the information (but I strongly 
believe it’s better to report the information separately).



(and, in fact, not work if TLAB support is turned off)? 


Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine 
question.)



 I might be missing something obvious (and see my response below).


 
All this requires is a separate counter that is set to the next sampling 
interval, and decremented when an allocation happens, which goes into a slow 
path when the decrement hits 0.  Doing a subtraction and a pointer bump in 
allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks 
care about.

Really?  A memory read and a subtraction?  Which architectures care about that?


I was not concerned with the read and subtraction, I was more concerned with 
the conditional that follows them (intel has great branch prediction).

And a personal pet peeve (based on past experience): How many “free” 
instructions do you have to add before they are not free any more?




Again, notice that no one has complained about the addition that was added for 
total bytes allocated per thread.  I note that was actually added in the 6u20 
timeframe.

Note that it has been doing an additional addition (to keep track of per thread 
allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size 
allocated per thread or number of allocated objects per thread? If it’s the 
former, why isn’t it possible to calculate that from the TLABs information?


Total size allocated per thread.  It isn't possible to calculate that from the 
TLAB because of out-of-TLAB allocation 


The allocating Thread is passed to the slow (outside-TLAB) alloc path so it 
would be trivial to update the per-thread allocation stats from there too (in 
fact, it does; see below).



(and hypothetically disabled TLABs).


Anyone cares? :-)




For some reason, they never included it in the ThreadMXBean interface, but it 
is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean to a 
com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on it.


Thanks for the tip. I’ll look into this...

and no one has complained.

I'm not worried about the ease of implementation here, because we've already 
implemented it.  


Yeah, but someone will have to maintain it moving forward.


I've been maintaining it internally to Google for 5 years.  It's actually 
pretty self-contained.  The only work involved is when they refactor something 
(so I've had to move it), or when a bug in the existing implementation is 
discovered.  It is very closely parallel to the TLAB code, which doesn't change 
much / at all.


The TLAB code has really not changed much for a while. ;-) (but haven’t looked 
at the JDK 9 source very closely though…)

It hasn't even been hard for us to do the forward port, except when the 
relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 
2^32, have the sampling code do nothing, and no one will ever notice.  


You still have extra instructions in the allocation path, so it’s not turned 
off (i.e., you have the tax without any benefit).


Hey, you have a counter in your allocation path you've never noticed, which 
none of your code uses.  Pipelining is a wonderful thing.  :)


See above re: “free” instructions.




In fact, we could just have the sampling code do nothing, and no one would ever 
notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made 
it more expensive to grab a stack trace (the cost became proportional to the 
number of loaded classes), but we have a patch that mitigates that, which we 
would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism 
is that there is quite a lot you can't do from user code during an allocation, 
because of lack of access to JNI.


Maybe I missed something. Are the callbacks in Java? I.e., do you call them 
using JNI from the slow path you call directly from the allocation code?

(For context: this referred to the hypothetical feature where we can provide a 
callback that invokes some code from allocation.)

(It's not actually hypothetical, because we've already implemented it, but 
let's call it hypothetical for the moment.)


OK.




We invoke native code.  You can't invoke any Java code during allocation, 
including calling JNI methods, because that would make allocation potentially 
reentrant, which doesn't work for all sorts of reasons.


That’s what I was worried about….



  The native code doesn't even get passed a JNIEnv * - there is nothing it can 
do with it without making the VM crash a lot.


So, thanks for the clarification. Being able to attach a callback to this in, 
say, the JVM it’d be totally fine. I was worried that you wanted to call Java. 
:-)




Or, rather, you might be able to do that, but it would take a lot of Hotspot 
rearchitecting.  When I tried to do it, I realized it would be an extremely 
deep dive.


I believe you. :-)




  However, you can do pretty much anything from the VM itself.  Crucially (for 
us), we don't just log the stack traces, we also keep track of which are live 
and which aren't.  We can't do this in a callback, if the callback can't create 
weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to 
(the callback gets invoked with a StackTraceData object, as I've defined 
above), and another that just tells you which sampled objects are still live.  
We could also add a third, which allowed a callback to set the sampling 
interval (basically, the VM would call it to get the integer number of bytes to 
be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I 
say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object 
come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and 
you could have identified it by instrumenting the “outside-of-TLAB allocation 
path” (just saying…).


That's orthogonal to the point I was making in the quote above - the point I 
was making there was that we want to be able to detect what sampled objects are 
live.  We can do that regardless of how we implement the sampling (although it 
did involve my making a new kind of weak oop processing mechanism inside the 
VM).


Yeah, I was thinking of doing something similar (tracking object lifetimes, and 
other attributes, with WeakRefs). 




But to the question of whether we can just instrument the outside-of-tlab 
allocation path...  There are a few weirdnesses here.  The first one that jumps 
to mind is that there's also a fast path for allocating in the YG outside of 
TLABs, if an object is too large to fit in the current TLAB.  Those objects 
would never get sampled.  So "outside of tlab" doesn't always mean "slow path".


CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow path 
called when a TLAB allocation fails because the object doesn’t fit in the 
current TLAB. It checks (alocate_from_tlab() / allocate_from_tlab_slow()) 
whether to refill the current TLAB or keep the TLAB and delegate to the GC 
(mem_allocate()) to allocate the object outside a TLAB (either in the young or 
old gen; the GC might also decide to do a collection at this point if, say, the 
eden is full...). So, it depends on what you mean by slow path but, yes, any 
alloocations that go through the above path should be considered as “slow path” 
allocations.

One more piece of data: AllocTracer::send_allocation_outside_tlab_event() (the 
JFR entry point for outside-TLAB allocs) is fired from 
common_mem_allocate_noint(). So, if there are other non-TLAB allocation paths 
outside that method, that entry point has been placed incorrectly (it’s 
possible of course; but I think that it’s actually placed correctly).

(note: I only looked at the JDK 8 sources, haven’t checked the JDK 9 sources 
yet, the above might have been changed)

BTW, when looking at the common_mem_allocate_noinit() code I noticed the 
following:

THREAD->incr_allocated_bytes(size * HeapWordSize);
(as predicted earlier)




Another one that jumps to mind is that we don't know whether the 
outside-of-TLAB path actually passes the sampling threshold, especially if we 
let users configure the sampling threshold.  So how would we know whether to 
sample it?


See above (IMHO: sample all of them).




You also have to keep track of the sampling interval in the code where we 
allocate new TLABs, in case the sampling threshold is larger than the TLAB 
size.  That's not a big deal, of course.


Of course, but that’s kinda trivial. BTW, one approach here would be “given 
that refilling a TLAB is slow anyway, always sample the first object in each 
TLAB irrespective of desired sampling frequence”. Another would be “don’t do 
that, I set the sampling frequency pretty low not to be flooded with data when 
the TLABs are very small”. I have to say I’m in the latter camp.





And, every time the TLAB code changes, we have to consider whether / how those 
changes affect this sampling mechanism.


Yes, but how often does the TLAB code change? :-)




I guess my larger point is that there are so many little corner cases with TLAB 
allocation, including whether it even happens, that basing the sampling 
strategy around it seems like a cop-out.  


There are not many little corner cases. There are two cases: allocation inside 
a TLAB, allocation outside a TLAB. The former is by far the most common. The 
latter is generally very infrequent and has a well-defined code path (I 
described it earlier). And, as I said, it could be very helpful and informative 
to treat (and account for) the two cases separately.



And my belief is that the arguments against our strategy don't really hold 
water, especially given the presence of the per-thread allocation counter that 
no one noticed.  


I’ve already addressed that.




Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck 
Rasbold).  All we really need is to write an acceptable JEP, to adjust the code 
based on the changes the community wants, and someone from Oracle willing to 
say "yes".



For reference, to keep track of sampling, the delta to C2 is about 150 LOC 
(much of which is newlines-because-of-formatting for methods that take a lot of 
parameters), the delta to C1 is about 60 LOC, the delta to each x86 template 
interpreter is about 20 LOC, and the delta for the assembler is about 40 LOC.   
   It's not completely trivial, but the code hasn't changed substantially in 
the 5 years since I wrote it (other than a couple of bugfixes).

Obviously, assembler/template interpreter would have to be dup'd across 
platforms - we can do that for PPC and aarch64, on which we do active 
development, at least.


I’ll again vote for the simplicity of having a simple change in only one place 
(OK, two places…).




But, seriously, why didn’t you like my proposal? It can do anything your scheme 
can with fewer and simpler code changes. The only thing that it cannot do is to 
sample based on object count (i.e., every 100 objects) instead of based on 
object size (i.e., every 1MB of allocations). But I think doing sampling based 
on size is the right approach here (IMHO).

I agree that sampling based on size is the right approach.  

(And your approach is definitely simpler - I don't mean to discount it.  And if 
that's what it takes to get this feature accepted, we'll do it, but I'll 
grumble about it.)


That’s fine. :-)

Tony




-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprinte...@twitter.com

Re: Low-Overhead Heap Profiling

Reply via email to