Re: Low-Overhead Heap Profiling

Tony Printezis Wed, 24 Jun 2015 10:58:49 -0700

Hi Jeremy,

Please see inline.


On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied 
to the sampling rate.  There is no reason to do that.  I want reasonable 
statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I 
never suggested to resize TLABs in order to vary the sampling rate. (Apologies 
if my e-mail was not clear.)




All this requires is a separate counter that is set to the next sampling 
interval, and decremented when an allocation happens, which goes into a slow 
path when the decrement hits 0.  Doing a subtraction and a pointer bump in 
allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks 
care about.



Note that it has been doing an additional addition (to keep track of per thread 
allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size 
allocated per thread or number of allocated objects per thread? If it’s the 
former, why isn’t it possible to calculate that from the TLABs information?



and no one has complained.

I'm not worried about the ease of implementation here, because we've already 
implemented it.  


Yeah, but someone will have to maintain it moving forward.



It hasn't even been hard for us to do the forward port, except when the 
relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 
2^32, have the sampling code do nothing, and no one will ever notice.  


You still have extra instructions in the allocation path, so it’s not turned 
off (i.e., you have the tax without any benefit).



In fact, we could just have the sampling code do nothing, and no one would ever 
notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made 
it more expensive to grab a stack trace (the cost became proportional to the 
number of loaded classes), but we have a patch that mitigates that, which we 
would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism 
is that there is quite a lot you can't do from user code during an allocation, 
because of lack of access to JNI.


Maybe I missed something. Are the callbacks in Java? I.e., do you call them 
using JNI from the slow path you call directly from the allocation code?



  However, you can do pretty much anything from the VM itself.  Crucially (for 
us), we don't just log the stack traces, we also keep track of which are live 
and which aren't.  We can't do this in a callback, if the callback can't create 
weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to 
(the callback gets invoked with a StackTraceData object, as I've defined 
above), and another that just tells you which sampled objects are still live.  
We could also add a third, which allowed a callback to set the sampling 
interval (basically, the VM would call it to get the integer number of bytes to 
be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I 
say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object 
come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and 
you could have identified it by instrumenting the “outside-of-TLAB allocation 
path” (just saying…).

But, seriously, why didn’t you like my proposal? It can do anything your scheme 
can with fewer and simpler code changes. The only thing that it cannot do is to 
sample based on object count (i.e., every 100 objects) instead of based on 
object size (i.e., every 1MB of allocations). But I think doing sampling based 
on size is the right approach here (IMHO).

Tony




Jeremy


On Tue, Jun 23, 2015 at 1:06 PM, Tony Printezis <tprinte...@twitter.com> wrote:
Jeremy (and all),

I’m not on the serviceability list so I won’t include the messages so far. :-) 
Also CCing the hotspot GC list, in case they have some feedback on this.

Could I suggest a (much) simpler but at least as powerful and flexible way to 
do this? (This is something we’ve been meaning to do for a while now for 
TwitterJDK, the JDK we develop and deploy here at Twitter.) You can force 
allocations to go into the slow path periodically by artificially setting the 
TLAB top to a lower value. So, imagine a TLAB is 4M. You can set top to 
(bottom+1M). When an allocation thinks the TLAB is full (in this case, the 
first 1MB is full) it will call the allocation slow path. There, you can 
intercept it, sample the allocation (and, like in your case, you’ll also have 
the correct stack trace), notice that the TLAB is not actually full, extend its 
to top to, say, (bottom+2M), and you’re done.

Advantages of this approach:

* This is a much smaller, simpler, and self-contained change (no compiler 
changes necessary to maintain...).

* When it’s off, the overhead is only one extra test at the slow path TLAB 
allocation (i.e., negligible; we do some sampling on TLABs in TwitterJDK using 
a similar mechanism and, when it’s off, I’ve observed no performance overhead).

* (most importantly) You can turn this on and off, and adjust the sampling 
rate, dynamically. If you do the sampling based on JITed code, you’ll have to 
recompile all methods with allocation sites to turn the sampling on or off. 
(You can of course have it always on and just discard the output; it’d be nice 
not to have to do that though. IMHO, at least.)

* You can also very cheaply turn this on and off (or adjust the sampling 
frequncy) per thread, if that’s be helpful in some way (just add the 
appropriate info on the thread’s TLAB).

A few extra comments on the previous discussion:

* "JFR samples per new TLAB allocation. It provides really very good picture 
and I haven't seen overhead more than 2” : When TLABs get very large, I don’t 
think sampling one object per TLAB is enough to get a good sample (IMHO, at 
least). It’s probably OK for something like jbb which mostly allocates 
instances of a handful of classes and has very few allocation sites. But, a lot 
of the code we run at Twitter is a lot more elaborate than that and, in our 
experience, sampling one object per TLAB is not enough. You can, of course, 
decrease the TLAB size to increase the sampling size. But it’d be good not to 
have to do that given a smaller TLAB size could increase contention across 
threads.

* "Should it *just* take a stack trace, or should the behavior be 
configurable?” : I think we’d have to separate the allocation sampling 
mechanism from the consumption of the allocation samples. Once the sampling 
mechanism is in, different JVMs can take advantage of it in different ways. I 
assume that the Oracle folks would like at least a JFR event for every such 
sample. But in your build you can add extra code to collect the information in 
the way you have now.

* Talking of JFR, it’s a bit unfortunate that the AllocObjectInNewTLAB event 
has both the new TLAB information and the allocation information. It would have 
been nice if that event was split into two, say NewTLAB and AllocObjectInTLAB, 
and we’d be able to fire the latter for each sample.

* "Should the interval between samples be configurable?” : Totally. In fact, 
it’d be helpful if it was configurable dynamically. Imagine if a JVM starts 
misbehaving after 2-3 weeks of running. You can dynamically increase the 
sampling rate to get a better profile if the default is not giving fine-grain 
enough information.

* "As long of these features don’t contribute to sampling bias” : If the 
sampling interval is fixed, sampling bias would be a very real concern. In the 
above example, I’d increment top by 1M (the sampling frequency) + p% (a fudge 
factor). 

* "Yes, a perhaps optional callbacks would be nice too.” : Oh, no. :-) But, as 
I said, we should definitely separate the sampling mechanism from the mechanism 
that consumes the samples.

* "Another problem with our submitting things is that we can't really test on 
anything other than Linux.” : Another reason to go with a as platform 
independent solution as possible. :-)

Regards,

Tony

-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprinte...@twitter.com









-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprinte...@twitter.com

Re: Low-Overhead Heap Profiling

Reply via email to