Re: Low-Overhead Heap Profiling

Karen Kinnear Thu, 25 Jun 2015 13:42:07 -0700

Jeremy,

Did I follow this correctly - that your approach modifies the compilers and 
interpreters and Tony's modifies the
common allocation code?


Given that the number of compilers and interpreters and interpreter platforms 
keeps expanding - I'd like to
add a vote to have heap allocation profiling in common allocation code.

thanks,
Karen

On Jun 25, 2015, at 4:28 PM, Tony Printezis wrote:

> Hi Jeremy,
> 
> Inline.
> 
> On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com) wrote:
> 
>> 
>> 
>> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <tprinte...@twitter.com> 
>> wrote:
>> Hi Jeremy,
>> 
>> Please see inline.
>> 
>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) 
>> wrote:
>> 
>>> I don't want the size of the TLAB, which is ergonomically adjusted, to be 
>>> tied to the sampling rate.  There is no reason to do that.  I want 
>>> reasonable statistical sampling of the allocations.  
>> 
>> 
>> As I said explicitly in my e-mail, I totally agree with this. Which is why I 
>> never suggested to resize TLABs in order to vary the sampling rate. 
>> (Apologies if my e-mail was not clear.)
>> 
>> 
>> My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs 
>> entirely
> 
> 
> This is correct: We’ll also have to intercept the outside-TLAB allocs. But, 
> IMHO, this is a feature as it’s helpful to know how many (and which) 
> allocations happen outside TLABs. These are generally very infrequent (and 
> slow anyway), so sampling all of those, instead of only sampling some of 
> them, does not have much of an overhead. But, you could also do sampling for 
> the outside-TLAB allocs too, if you want: just accumulate their size on a 
> separate per-thread counter and sample the one that bumps that counter goes 
> over a limit.
> 
> An additional observation (orthogonal to the main point, but I thought I’d 
> mention it anyway): For the outside-TLAB allocs it’d be helpful to also know 
> which generation the object ended up in (e.g., young gen or 
> direct-to-old-gen). This is very helpful in some situations when you’re 
> trying to work out which allocation(s) grew the old gen occupancy between two 
> young GCs.
> 
> FWIW, the existing JFR events follow the approach I described above:
> 
> * one event for each new TLAB + first alloc in that TLAB (my proposal 
> basically generalizes this and removes the 1-1 relationship between object 
> alloc sampling and new TLAB operation)
> 
> * one event for all allocs outside a TLAB
> 
> I think the above separation is helpful. But if you think it could confuse 
> users, you can of course easily just combine the information (but I strongly 
> believe it’s better to report the information separately).
> 
> 
> 
>> (and, in fact, not work if TLAB support is turned off)? 
> 
> 
> Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine 
> question.)
> 
> 
> 
>>  I might be missing something obvious (and see my response below).
> 
> 
>>  
>>> All this requires is a separate counter that is set to the next sampling 
>>> interval, and decremented when an allocation happens, which goes into a 
>>> slow path when the decrement hits 0.  Doing a subtraction and a pointer 
>>> bump in allocation instead of just a pointer bump is basically free.  
>> 
>> 
>> Maybe on intel is cheap, but maybe it’s not on other platforms that other 
>> folks care about.
>> 
>> Really?  A memory read and a subtraction?  Which architectures care about 
>> that?
> 
> 
> I was not concerned with the read and subtraction, I was more concerned with 
> the conditional that follows them (intel has great branch prediction).
> 
> And a personal pet peeve (based on past experience): How many “free” 
> instructions do you have to add before they are not free any more?
> 
> 
> 
>> 
>> Again, notice that no one has complained about the addition that was added 
>> for total bytes allocated per thread.  I note that was actually added in the 
>> 6u20 timeframe.
>> 
>>> Note that it has been doing an additional addition (to keep track of per 
>>> thread allocation) as part of allocation since Java 7, 
>> 
>> 
>> Interesting. I hadn’t realized that. Does that keep track of total size 
>> allocated per thread or number of allocated objects per thread? If it’s the 
>> former, why isn’t it possible to calculate that from the TLABs information?
>> 
>> 
>> Total size allocated per thread.  It isn't possible to calculate that from 
>> the TLAB because of out-of-TLAB allocation 
> 
> 
> The allocating Thread is passed to the slow (outside-TLAB) alloc path so it 
> would be trivial to update the per-thread allocation stats from there too (in 
> fact, it does; see below).
> 
> 
> 
>> (and hypothetically disabled TLABs).
> 
> 
> Anyone cares? :-)
> 
> 
> 
>> 
>> For some reason, they never included it in the ThreadMXBean interface, but 
>> it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean 
>> to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on 
>> it.
>> 
> 
> Thanks for the tip. I’ll look into this...
> 
>>> and no one has complained.
>>> 
>>> I'm not worried about the ease of implementation here, because we've 
>>> already implemented it.  
>> 
>> 
>> Yeah, but someone will have to maintain it moving forward.
>> 
>> 
>> I've been maintaining it internally to Google for 5 years.  It's actually 
>> pretty self-contained.  The only work involved is when they refactor 
>> something (so I've had to move it), or when a bug in the existing 
>> implementation is discovered.  It is very closely parallel to the TLAB code, 
>> which doesn't change much / at all.
>> 
> 
> The TLAB code has really not changed much for a while. ;-) (but haven’t 
> looked at the JDK 9 source very closely though…)
> 
>>> It hasn't even been hard for us to do the forward port, except when the 
>>> relevant Hotspot code is significantly refactored.
>>> 
>>> We can also turn the sampling off, if we want.  We can set the sampling 
>>> rate to 2^32, have the sampling code do nothing, and no one will ever 
>>> notice.  
>> 
>> 
>> You still have extra instructions in the allocation path, so it’s not turned 
>> off (i.e., you have the tax without any benefit).
>> 
>> 
>> Hey, you have a counter in your allocation path you've never noticed, which 
>> none of your code uses.  Pipelining is a wonderful thing.  :)
> 
> 
> See above re: “free” instructions.
> 
> 
> 
>>> 
>>> In fact, we could just have the sampling code do nothing, and no one would 
>>> ever notice.
>>> 
>>> Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 
>>> made it more expensive to grab a stack trace (the cost became proportional 
>>> to the number of loaded classes), but we have a patch that mitigates that, 
>>> which we would also be happy to upstream.
>>> 
>>> As for the other concern: my concern about *just* having the callback 
>>> mechanism is that there is quite a lot you can't do from user code during 
>>> an allocation, because of lack of access to JNI.
>> 
>> 
>> Maybe I missed something. Are the callbacks in Java? I.e., do you call them 
>> using JNI from the slow path you call directly from the allocation code?
>> 
>> (For context: this referred to the hypothetical feature where we can provide 
>> a callback that invokes some code from allocation.)
>> 
>> (It's not actually hypothetical, because we've already implemented it, but 
>> let's call it hypothetical for the moment.)
> 
> 
> OK.
> 
> 
> 
>> 
>> We invoke native code.  You can't invoke any Java code during allocation, 
>> including calling JNI methods, because that would make allocation 
>> potentially reentrant, which doesn't work for all sorts of reasons.
> 
> 
> That’s what I was worried about….
> 
> 
> 
>>   The native code doesn't even get passed a JNIEnv * - there is nothing it 
>> can do with it without making the VM crash a lot.
> 
> 
> So, thanks for the clarification. Being able to attach a callback to this in, 
> say, the JVM it’d be totally fine. I was worried that you wanted to call 
> Java. :-)
> 
> 
> 
>> 
>> Or, rather, you might be able to do that, but it would take a lot of Hotspot 
>> rearchitecting.  When I tried to do it, I realized it would be an extremely 
>> deep dive.
> 
> 
> I believe you. :-)
> 
> 
> 
>>> 
>>>   However, you can do pretty much anything from the VM itself.  Crucially 
>>> (for us), we don't just log the stack traces, we also keep track of which 
>>> are live and which aren't.  We can't do this in a callback, if the callback 
>>> can't create weak refs to the object.
>>> 
>>> What we do at Google is to have two methods: one that you pass a callback 
>>> to (the callback gets invoked with a StackTraceData object, as I've defined 
>>> above), and another that just tells you which sampled objects are still 
>>> live.  We could also add a third, which allowed a callback to set the 
>>> sampling interval (basically, the VM would call it to get the integer 
>>> number of bytes to be allocated before the next sample).  
>>> 
>>> Would people be amenable to that?  It makes the code more complex, but, as 
>>> I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB 
>>> object come from?").
>> 
>> 
>> Well, that 1GB object would have most likely been allocated outside a TLAB 
>> and you could have identified it by instrumenting the “outside-of-TLAB 
>> allocation path” (just saying…).
>> 
>> 
>> That's orthogonal to the point I was making in the quote above - the point I 
>> was making there was that we want to be able to detect what sampled objects 
>> are live.  We can do that regardless of how we implement the sampling 
>> (although it did involve my making a new kind of weak oop processing 
>> mechanism inside the VM).
> 
> 
> Yeah, I was thinking of doing something similar (tracking object lifetimes, 
> and other attributes, with WeakRefs). 
> 
> 
> 
>> 
>> But to the question of whether we can just instrument the outside-of-tlab 
>> allocation path...  There are a few weirdnesses here.  The first one that 
>> jumps to mind is that there's also a fast path for allocating in the YG 
>> outside of TLABs, if an object is too large to fit in the current TLAB.  
>> Those objects would never get sampled.  So "outside of tlab" doesn't always 
>> mean "slow path".
> 
> 
> CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow 
> path called when a TLAB allocation fails because the object doesn’t fit in 
> the current TLAB. It checks (alocate_from_tlab() / allocate_from_tlab_slow()) 
> whether to refill the current TLAB or keep the TLAB and delegate to the GC 
> (mem_allocate()) to allocate the object outside a TLAB (either in the young 
> or old gen; the GC might also decide to do a collection at this point if, 
> say, the eden is full...). So, it depends on what you mean by slow path but, 
> yes, any alloocations that go through the above path should be considered as 
> “slow path” allocations.
> 
> One more piece of data: AllocTracer::send_allocation_outside_tlab_event() 
> (the JFR entry point for outside-TLAB allocs) is fired from 
> common_mem_allocate_noint(). So, if there are other non-TLAB allocation paths 
> outside that method, that entry point has been placed incorrectly (it’s 
> possible of course; but I think that it’s actually placed correctly).
> 
> (note: I only looked at the JDK 8 sources, haven’t checked the JDK 9 sources 
> yet, the above might have been changed)
> 
> BTW, when looking at the common_mem_allocate_noinit() code I noticed the 
> following:
> 
> THREAD->incr_allocated_bytes(size * HeapWordSize);
> (as predicted earlier)
> 
> 
> 
>> 
>> Another one that jumps to mind is that we don't know whether the 
>> outside-of-TLAB path actually passes the sampling threshold, especially if 
>> we let users configure the sampling threshold.  So how would we know whether 
>> to sample it?
> 
> 
> See above (IMHO: sample all of them).
> 
> 
> 
>> 
>> You also have to keep track of the sampling interval in the code where we 
>> allocate new TLABs, in case the sampling threshold is larger than the TLAB 
>> size.  That's not a big deal, of course.
> 
> 
> Of course, but that’s kinda trivial. BTW, one approach here would be “given 
> that refilling a TLAB is slow anyway, always sample the first object in each 
> TLAB irrespective of desired sampling frequence”. Another would be “don’t do 
> that, I set the sampling frequency pretty low not to be flooded with data 
> when the TLABs are very small”. I have to say I’m in the latter camp.
> 
> 
> 
>> 
>> 
>> And, every time the TLAB code changes, we have to consider whether / how 
>> those changes affect this sampling mechanism.
> 
> 
> Yes, but how often does the TLAB code change? :-)
> 
> 
> 
>> 
>> I guess my larger point is that there are so many little corner cases with 
>> TLAB allocation, including whether it even happens, that basing the sampling 
>> strategy around it seems like a cop-out.  
> 
> 
> There are not many little corner cases. There are two cases: allocation 
> inside a TLAB, allocation outside a TLAB. The former is by far the most 
> common. The latter is generally very infrequent and has a well-defined code 
> path (I described it earlier). And, as I said, it could be very helpful and 
> informative to treat (and account for) the two cases separately.
> 
> 
> 
>> And my belief is that the arguments against our strategy don't really hold 
>> water, especially given the presence of the per-thread allocation counter 
>> that no one noticed.  
> 
> 
> I’ve already addressed that.
> 
> 
> 
>> 
>> Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck 
>> Rasbold).  All we really need is to write an acceptable JEP, to adjust the 
>> code based on the changes the community wants, and someone from Oracle 
>> willing to say "yes".
> 
> 
>> 
>> For reference, to keep track of sampling, the delta to C2 is about 150 LOC 
>> (much of which is newlines-because-of-formatting for methods that take a lot 
>> of parameters), the delta to C1 is about 60 LOC, the delta to each x86 
>> template interpreter is about 20 LOC, and the delta for the assembler is 
>> about 40 LOC.      It's not completely trivial, but the code hasn't changed 
>> substantially in the 5 years since I wrote it (other than a couple of 
>> bugfixes).
>> 
>> Obviously, assembler/template interpreter would have to be dup'd across 
>> platforms - we can do that for PPC and aarch64, on which we do active 
>> development, at least.
> 
> 
> I’ll again vote for the simplicity of having a simple change in only one 
> place (OK, two places…).
> 
> 
> 
>> 
>> But, seriously, why didn’t you like my proposal? It can do anything your 
>> scheme can with fewer and simpler code changes. The only thing that it 
>> cannot do is to sample based on object count (i.e., every 100 objects) 
>> instead of based on object size (i.e., every 1MB of allocations). But I 
>> think doing sampling based on size is the right approach here (IMHO).
>> 
>> I agree that sampling based on size is the right approach.  
>> 
>> (And your approach is definitely simpler - I don't mean to discount it.  And 
>> if that's what it takes to get this feature accepted, we'll do it, but I'll 
>> grumble about it.)
> 
> 
> That’s fine. :-)
> 
> Tony
> 
> 
> 
> 
> -----
> 
> Tony Printezis | JVM/GC Engineer / VM Team | Twitter
> 
> @TonyPrintezis
> tprinte...@twitter.com
> 
>

Re: Low-Overhead Heap Profiling

Reply via email to