Jeremy, Did I follow this correctly - that your approach modifies the compilers and interpreters and Tony's modifies the common allocation code?
Given that the number of compilers and interpreters and interpreter platforms keeps expanding - I'd like to add a vote to have heap allocation profiling in common allocation code. thanks, Karen On Jun 25, 2015, at 4:28 PM, Tony Printezis wrote: > Hi Jeremy, > > Inline. > > On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com) wrote: > >> >> >> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <tprinte...@twitter.com> >> wrote: >> Hi Jeremy, >> >> Please see inline. >> >> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) >> wrote: >> >>> I don't want the size of the TLAB, which is ergonomically adjusted, to be >>> tied to the sampling rate. There is no reason to do that. I want >>> reasonable statistical sampling of the allocations. >> >> >> As I said explicitly in my e-mail, I totally agree with this. Which is why I >> never suggested to resize TLABs in order to vary the sampling rate. >> (Apologies if my e-mail was not clear.) >> >> >> My fault - I misread it. Doesn't your proposal miss out of TLAB allocs >> entirely > > > This is correct: We’ll also have to intercept the outside-TLAB allocs. But, > IMHO, this is a feature as it’s helpful to know how many (and which) > allocations happen outside TLABs. These are generally very infrequent (and > slow anyway), so sampling all of those, instead of only sampling some of > them, does not have much of an overhead. But, you could also do sampling for > the outside-TLAB allocs too, if you want: just accumulate their size on a > separate per-thread counter and sample the one that bumps that counter goes > over a limit. > > An additional observation (orthogonal to the main point, but I thought I’d > mention it anyway): For the outside-TLAB allocs it’d be helpful to also know > which generation the object ended up in (e.g., young gen or > direct-to-old-gen). This is very helpful in some situations when you’re > trying to work out which allocation(s) grew the old gen occupancy between two > young GCs. > > FWIW, the existing JFR events follow the approach I described above: > > * one event for each new TLAB + first alloc in that TLAB (my proposal > basically generalizes this and removes the 1-1 relationship between object > alloc sampling and new TLAB operation) > > * one event for all allocs outside a TLAB > > I think the above separation is helpful. But if you think it could confuse > users, you can of course easily just combine the information (but I strongly > believe it’s better to report the information separately). > > > >> (and, in fact, not work if TLAB support is turned off)? > > > Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine > question.) > > > >> I might be missing something obvious (and see my response below). > > >> >>> All this requires is a separate counter that is set to the next sampling >>> interval, and decremented when an allocation happens, which goes into a >>> slow path when the decrement hits 0. Doing a subtraction and a pointer >>> bump in allocation instead of just a pointer bump is basically free. >> >> >> Maybe on intel is cheap, but maybe it’s not on other platforms that other >> folks care about. >> >> Really? A memory read and a subtraction? Which architectures care about >> that? > > > I was not concerned with the read and subtraction, I was more concerned with > the conditional that follows them (intel has great branch prediction). > > And a personal pet peeve (based on past experience): How many “free” > instructions do you have to add before they are not free any more? > > > >> >> Again, notice that no one has complained about the addition that was added >> for total bytes allocated per thread. I note that was actually added in the >> 6u20 timeframe. >> >>> Note that it has been doing an additional addition (to keep track of per >>> thread allocation) as part of allocation since Java 7, >> >> >> Interesting. I hadn’t realized that. Does that keep track of total size >> allocated per thread or number of allocated objects per thread? If it’s the >> former, why isn’t it possible to calculate that from the TLABs information? >> >> >> Total size allocated per thread. It isn't possible to calculate that from >> the TLAB because of out-of-TLAB allocation > > > The allocating Thread is passed to the slow (outside-TLAB) alloc path so it > would be trivial to update the per-thread allocation stats from there too (in > fact, it does; see below). > > > >> (and hypothetically disabled TLABs). > > > Anyone cares? :-) > > > >> >> For some reason, they never included it in the ThreadMXBean interface, but >> it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean >> to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on >> it. >> > > Thanks for the tip. I’ll look into this... > >>> and no one has complained. >>> >>> I'm not worried about the ease of implementation here, because we've >>> already implemented it. >> >> >> Yeah, but someone will have to maintain it moving forward. >> >> >> I've been maintaining it internally to Google for 5 years. It's actually >> pretty self-contained. The only work involved is when they refactor >> something (so I've had to move it), or when a bug in the existing >> implementation is discovered. It is very closely parallel to the TLAB code, >> which doesn't change much / at all. >> > > The TLAB code has really not changed much for a while. ;-) (but haven’t > looked at the JDK 9 source very closely though…) > >>> It hasn't even been hard for us to do the forward port, except when the >>> relevant Hotspot code is significantly refactored. >>> >>> We can also turn the sampling off, if we want. We can set the sampling >>> rate to 2^32, have the sampling code do nothing, and no one will ever >>> notice. >> >> >> You still have extra instructions in the allocation path, so it’s not turned >> off (i.e., you have the tax without any benefit). >> >> >> Hey, you have a counter in your allocation path you've never noticed, which >> none of your code uses. Pipelining is a wonderful thing. :) > > > See above re: “free” instructions. > > > >>> >>> In fact, we could just have the sampling code do nothing, and no one would >>> ever notice. >>> >>> Honestly, no one ever notices the overhead of the sampling, anyway. JDK8 >>> made it more expensive to grab a stack trace (the cost became proportional >>> to the number of loaded classes), but we have a patch that mitigates that, >>> which we would also be happy to upstream. >>> >>> As for the other concern: my concern about *just* having the callback >>> mechanism is that there is quite a lot you can't do from user code during >>> an allocation, because of lack of access to JNI. >> >> >> Maybe I missed something. Are the callbacks in Java? I.e., do you call them >> using JNI from the slow path you call directly from the allocation code? >> >> (For context: this referred to the hypothetical feature where we can provide >> a callback that invokes some code from allocation.) >> >> (It's not actually hypothetical, because we've already implemented it, but >> let's call it hypothetical for the moment.) > > > OK. > > > >> >> We invoke native code. You can't invoke any Java code during allocation, >> including calling JNI methods, because that would make allocation >> potentially reentrant, which doesn't work for all sorts of reasons. > > > That’s what I was worried about…. > > > >> The native code doesn't even get passed a JNIEnv * - there is nothing it >> can do with it without making the VM crash a lot. > > > So, thanks for the clarification. Being able to attach a callback to this in, > say, the JVM it’d be totally fine. I was worried that you wanted to call > Java. :-) > > > >> >> Or, rather, you might be able to do that, but it would take a lot of Hotspot >> rearchitecting. When I tried to do it, I realized it would be an extremely >> deep dive. > > > I believe you. :-) > > > >>> >>> However, you can do pretty much anything from the VM itself. Crucially >>> (for us), we don't just log the stack traces, we also keep track of which >>> are live and which aren't. We can't do this in a callback, if the callback >>> can't create weak refs to the object. >>> >>> What we do at Google is to have two methods: one that you pass a callback >>> to (the callback gets invoked with a StackTraceData object, as I've defined >>> above), and another that just tells you which sampled objects are still >>> live. We could also add a third, which allowed a callback to set the >>> sampling interval (basically, the VM would call it to get the integer >>> number of bytes to be allocated before the next sample). >>> >>> Would people be amenable to that? It makes the code more complex, but, as >>> I say, it's nice for detecting memory leaks ("Hey! Where did that 1 GB >>> object come from?"). >> >> >> Well, that 1GB object would have most likely been allocated outside a TLAB >> and you could have identified it by instrumenting the “outside-of-TLAB >> allocation path” (just saying…). >> >> >> That's orthogonal to the point I was making in the quote above - the point I >> was making there was that we want to be able to detect what sampled objects >> are live. We can do that regardless of how we implement the sampling >> (although it did involve my making a new kind of weak oop processing >> mechanism inside the VM). > > > Yeah, I was thinking of doing something similar (tracking object lifetimes, > and other attributes, with WeakRefs). > > > >> >> But to the question of whether we can just instrument the outside-of-tlab >> allocation path... There are a few weirdnesses here. The first one that >> jumps to mind is that there's also a fast path for allocating in the YG >> outside of TLABs, if an object is too large to fit in the current TLAB. >> Those objects would never get sampled. So "outside of tlab" doesn't always >> mean "slow path". > > > CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow > path called when a TLAB allocation fails because the object doesn’t fit in > the current TLAB. It checks (alocate_from_tlab() / allocate_from_tlab_slow()) > whether to refill the current TLAB or keep the TLAB and delegate to the GC > (mem_allocate()) to allocate the object outside a TLAB (either in the young > or old gen; the GC might also decide to do a collection at this point if, > say, the eden is full...). So, it depends on what you mean by slow path but, > yes, any alloocations that go through the above path should be considered as > “slow path” allocations. > > One more piece of data: AllocTracer::send_allocation_outside_tlab_event() > (the JFR entry point for outside-TLAB allocs) is fired from > common_mem_allocate_noint(). So, if there are other non-TLAB allocation paths > outside that method, that entry point has been placed incorrectly (it’s > possible of course; but I think that it’s actually placed correctly). > > (note: I only looked at the JDK 8 sources, haven’t checked the JDK 9 sources > yet, the above might have been changed) > > BTW, when looking at the common_mem_allocate_noinit() code I noticed the > following: > > THREAD->incr_allocated_bytes(size * HeapWordSize); > (as predicted earlier) > > > >> >> Another one that jumps to mind is that we don't know whether the >> outside-of-TLAB path actually passes the sampling threshold, especially if >> we let users configure the sampling threshold. So how would we know whether >> to sample it? > > > See above (IMHO: sample all of them). > > > >> >> You also have to keep track of the sampling interval in the code where we >> allocate new TLABs, in case the sampling threshold is larger than the TLAB >> size. That's not a big deal, of course. > > > Of course, but that’s kinda trivial. BTW, one approach here would be “given > that refilling a TLAB is slow anyway, always sample the first object in each > TLAB irrespective of desired sampling frequence”. Another would be “don’t do > that, I set the sampling frequency pretty low not to be flooded with data > when the TLABs are very small”. I have to say I’m in the latter camp. > > > >> >> >> And, every time the TLAB code changes, we have to consider whether / how >> those changes affect this sampling mechanism. > > > Yes, but how often does the TLAB code change? :-) > > > >> >> I guess my larger point is that there are so many little corner cases with >> TLAB allocation, including whether it even happens, that basing the sampling >> strategy around it seems like a cop-out. > > > There are not many little corner cases. There are two cases: allocation > inside a TLAB, allocation outside a TLAB. The former is by far the most > common. The latter is generally very infrequent and has a well-defined code > path (I described it earlier). And, as I said, it could be very helpful and > informative to treat (and account for) the two cases separately. > > > >> And my belief is that the arguments against our strategy don't really hold >> water, especially given the presence of the per-thread allocation counter >> that no one noticed. > > > I’ve already addressed that. > > > >> >> Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck >> Rasbold). All we really need is to write an acceptable JEP, to adjust the >> code based on the changes the community wants, and someone from Oracle >> willing to say "yes". > > >> >> For reference, to keep track of sampling, the delta to C2 is about 150 LOC >> (much of which is newlines-because-of-formatting for methods that take a lot >> of parameters), the delta to C1 is about 60 LOC, the delta to each x86 >> template interpreter is about 20 LOC, and the delta for the assembler is >> about 40 LOC. It's not completely trivial, but the code hasn't changed >> substantially in the 5 years since I wrote it (other than a couple of >> bugfixes). >> >> Obviously, assembler/template interpreter would have to be dup'd across >> platforms - we can do that for PPC and aarch64, on which we do active >> development, at least. > > > I’ll again vote for the simplicity of having a simple change in only one > place (OK, two places…). > > > >> >> But, seriously, why didn’t you like my proposal? It can do anything your >> scheme can with fewer and simpler code changes. The only thing that it >> cannot do is to sample based on object count (i.e., every 100 objects) >> instead of based on object size (i.e., every 1MB of allocations). But I >> think doing sampling based on size is the right approach here (IMHO). >> >> I agree that sampling based on size is the right approach. >> >> (And your approach is definitely simpler - I don't mean to discount it. And >> if that's what it takes to get this feature accepted, we'll do it, but I'll >> grumble about it.) > > > That’s fine. :-) > > Tony > > > > > ----- > > Tony Printezis | JVM/GC Engineer / VM Team | Twitter > > @TonyPrintezis > tprinte...@twitter.com > >