Thanks Jeremy - that helps - not such a big deal in terms of maintaining - so back to the bigger discussion.
thanks, Karen On Jun 26, 2015, at 1:34 AM, Jeremy Manson wrote: > Karen, > > I understand your concerns. For reference, this is the additional code in > the x86 assembler. There are very small stubs in C1 and the template > interpreter to call out to this macro (forgive the gratuitous use of the > string "google" - we sprinkle it around a bit because it makes it a little > easier to distinguish our code from upstream code). > > #define GOOGLE_HEAP_MONITORING(ma, thread, var_size_in_bytes, > con_size_in_bytes, object, t1, t2, sample_invocation) \ > do { \ > { \ > SkipIfEqual skip_if(ma, &GoogleHeapMonitor, 0); \ > Label skip_sample; \ > Register thr = thread; \ > if (!thr->is_valid()) { \ > NOT_LP64(assert(t1 != noreg, \ > "Need temporary register for constants")); \ > thr = NOT_LP64(t1) LP64_ONLY(r15_thread); \ > NOT_LP64(ma -> get_thread(thr);) \ > } \ > /* Trigger heap monitoring event */ \ > Address bus(thr, \ > JavaThread::google_bytes_until_sample_offset()); \ > \ > if (var_size_in_bytes->is_valid()) { \ > ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, var_size_in_bytes); \ > } else { \ > int csib = (con_size_in_bytes); \ > assert(t2 != noreg, \ > "Need temporary register for constants"); \ > ma -> NOT_LP64(movl) LP64_ONLY(mov64)(t2, csib); \ > ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, t2); \ > } \ > \ > ma -> jcc(Assembler::positive, skip_sample); \ > \ > { \ > sample_invocation \ > } \ > ma -> bind(skip_sample); \ > } \ > } while(0) > > It's not all that hard to port to additional architectures, but we'll have to > think about it. > > Jeremy > > On Thu, Jun 25, 2015 at 1:41 PM, Karen Kinnear <karen.kinn...@oracle.com> > wrote: > Jeremy, > > Did I follow this correctly - that your approach modifies the compilers and > interpreters and Tony's modifies the > common allocation code? > > Given that the number of compilers and interpreters and interpreter platforms > keeps expanding - I'd like to > add a vote to have heap allocation profiling in common allocation code. > > thanks, > Karen > > On Jun 25, 2015, at 4:28 PM, Tony Printezis wrote: > >> Hi Jeremy, >> >> Inline. >> >> On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com) >> wrote: >> >>> >>> >>> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <tprinte...@twitter.com> >>> wrote: >>> Hi Jeremy, >>> >>> Please see inline. >>> >>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) >>> wrote: >>> >>>> I don't want the size of the TLAB, which is ergonomically adjusted, to be >>>> tied to the sampling rate. There is no reason to do that. I want >>>> reasonable statistical sampling of the allocations. >>> >>> >>> As I said explicitly in my e-mail, I totally agree with this. Which is why >>> I never suggested to resize TLABs in order to vary the sampling rate. >>> (Apologies if my e-mail was not clear.) >>> >>> >>> My fault - I misread it. Doesn't your proposal miss out of TLAB allocs >>> entirely >> >> >> This is correct: We’ll also have to intercept the outside-TLAB allocs. But, >> IMHO, this is a feature as it’s helpful to know how many (and which) >> allocations happen outside TLABs. These are generally very infrequent (and >> slow anyway), so sampling all of those, instead of only sampling some of >> them, does not have much of an overhead. But, you could also do sampling for >> the outside-TLAB allocs too, if you want: just accumulate their size on a >> separate per-thread counter and sample the one that bumps that counter goes >> over a limit. >> >> An additional observation (orthogonal to the main point, but I thought I’d >> mention it anyway): For the outside-TLAB allocs it’d be helpful to also know >> which generation the object ended up in (e.g., young gen or >> direct-to-old-gen). This is very helpful in some situations when you’re >> trying to work out which allocation(s) grew the old gen occupancy between >> two young GCs. >> >> FWIW, the existing JFR events follow the approach I described above: >> >> * one event for each new TLAB + first alloc in that TLAB (my proposal >> basically generalizes this and removes the 1-1 relationship between object >> alloc sampling and new TLAB operation) >> >> * one event for all allocs outside a TLAB >> >> I think the above separation is helpful. But if you think it could confuse >> users, you can of course easily just combine the information (but I strongly >> believe it’s better to report the information separately). >> >> >> >>> (and, in fact, not work if TLAB support is turned off)? >> >> >> Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine >> question.) >> >> >> >>> I might be missing something obvious (and see my response below). >> >> >>> >>>> All this requires is a separate counter that is set to the next sampling >>>> interval, and decremented when an allocation happens, which goes into a >>>> slow path when the decrement hits 0. Doing a subtraction and a pointer >>>> bump in allocation instead of just a pointer bump is basically free. >>> >>> >>> Maybe on intel is cheap, but maybe it’s not on other platforms that other >>> folks care about. >>> >>> Really? A memory read and a subtraction? Which architectures care about >>> that? >> >> >> I was not concerned with the read and subtraction, I was more concerned with >> the conditional that follows them (intel has great branch prediction). >> >> And a personal pet peeve (based on past experience): How many “free” >> instructions do you have to add before they are not free any more? >> >> >> >>> >>> Again, notice that no one has complained about the addition that was added >>> for total bytes allocated per thread. I note that was actually added in >>> the 6u20 timeframe. >>> >>>> Note that it has been doing an additional addition (to keep track of per >>>> thread allocation) as part of allocation since Java 7, >>> >>> >>> Interesting. I hadn’t realized that. Does that keep track of total size >>> allocated per thread or number of allocated objects per thread? If it’s the >>> former, why isn’t it possible to calculate that from the TLABs information? >>> >>> >>> Total size allocated per thread. It isn't possible to calculate that from >>> the TLAB because of out-of-TLAB allocation >> >> >> The allocating Thread is passed to the slow (outside-TLAB) alloc path so it >> would be trivial to update the per-thread allocation stats from there too >> (in fact, it does; see below). >> >> >> >>> (and hypothetically disabled TLABs). >> >> >> Anyone cares? :-) >> >> >> >>> >>> For some reason, they never included it in the ThreadMXBean interface, but >>> it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean >>> to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on >>> it. >>> >> >> Thanks for the tip. I’ll look into this... >> >>>> and no one has complained. >>>> >>>> I'm not worried about the ease of implementation here, because we've >>>> already implemented it. >>> >>> >>> Yeah, but someone will have to maintain it moving forward. >>> >>> >>> I've been maintaining it internally to Google for 5 years. It's actually >>> pretty self-contained. The only work involved is when they refactor >>> something (so I've had to move it), or when a bug in the existing >>> implementation is discovered. It is very closely parallel to the TLAB >>> code, which doesn't change much / at all. >>> >> >> The TLAB code has really not changed much for a while. ;-) (but haven’t >> looked at the JDK 9 source very closely though…) >> >>>> It hasn't even been hard for us to do the forward port, except when the >>>> relevant Hotspot code is significantly refactored. >>>> >>>> We can also turn the sampling off, if we want. We can set the sampling >>>> rate to 2^32, have the sampling code do nothing, and no one will ever >>>> notice. >>> >>> >>> You still have extra instructions in the allocation path, so it’s not >>> turned off (i.e., you have the tax without any benefit). >>> >>> >>> Hey, you have a counter in your allocation path you've never noticed, which >>> none of your code uses. Pipelining is a wonderful thing. :) >> >> >> See above re: “free” instructions. >> >> >> >>>> >>>> In fact, we could just have the sampling code do nothing, and no one would >>>> ever notice. >>>> >>>> Honestly, no one ever notices the overhead of the sampling, anyway. JDK8 >>>> made it more expensive to grab a stack trace (the cost became proportional >>>> to the number of loaded classes), but we have a patch that mitigates that, >>>> which we would also be happy to upstream. >>>> >>>> As for the other concern: my concern about *just* having the callback >>>> mechanism is that there is quite a lot you can't do from user code during >>>> an allocation, because of lack of access to JNI. >>> >>> >>> Maybe I missed something. Are the callbacks in Java? I.e., do you call them >>> using JNI from the slow path you call directly from the allocation code? >>> >>> (For context: this referred to the hypothetical feature where we can >>> provide a callback that invokes some code from allocation.) >>> >>> (It's not actually hypothetical, because we've already implemented it, but >>> let's call it hypothetical for the moment.) >> >> >> OK. >> >> >> >>> >>> We invoke native code. You can't invoke any Java code during allocation, >>> including calling JNI methods, because that would make allocation >>> potentially reentrant, which doesn't work for all sorts of reasons. >> >> >> That’s what I was worried about…. >> >> >> >>> The native code doesn't even get passed a JNIEnv * - there is nothing it >>> can do with it without making the VM crash a lot. >> >> >> So, thanks for the clarification. Being able to attach a callback to this >> in, say, the JVM it’d be totally fine. I was worried that you wanted to call >> Java. :-) >> >> >> >>> >>> Or, rather, you might be able to do that, but it would take a lot of >>> Hotspot rearchitecting. When I tried to do it, I realized it would be an >>> extremely deep dive. >> >> >> I believe you. :-) >> >> >> >>>> >>>> However, you can do pretty much anything from the VM itself. Crucially >>>> (for us), we don't just log the stack traces, we also keep track of which >>>> are live and which aren't. We can't do this in a callback, if the >>>> callback can't create weak refs to the object. >>>> >>>> What we do at Google is to have two methods: one that you pass a callback >>>> to (the callback gets invoked with a StackTraceData object, as I've >>>> defined above), and another that just tells you which sampled objects are >>>> still live. We could also add a third, which allowed a callback to set >>>> the sampling interval (basically, the VM would call it to get the integer >>>> number of bytes to be allocated before the next sample). >>>> >>>> Would people be amenable to that? It makes the code more complex, but, as >>>> I say, it's nice for detecting memory leaks ("Hey! Where did that 1 GB >>>> object come from?"). >>> >>> >>> Well, that 1GB object would have most likely been allocated outside a TLAB >>> and you could have identified it by instrumenting the “outside-of-TLAB >>> allocation path” (just saying…). >>> >>> >>> That's orthogonal to the point I was making in the quote above - the point >>> I was making there was that we want to be able to detect what sampled >>> objects are live. We can do that regardless of how we implement the >>> sampling (although it did involve my making a new kind of weak oop >>> processing mechanism inside the VM). >> >> >> Yeah, I was thinking of doing something similar (tracking object lifetimes, >> and other attributes, with WeakRefs). >> >> >> >>> >>> But to the question of whether we can just instrument the outside-of-tlab >>> allocation path... There are a few weirdnesses here. The first one that >>> jumps to mind is that there's also a fast path for allocating in the YG >>> outside of TLABs, if an object is too large to fit in the current TLAB. >>> Those objects would never get sampled. So "outside of tlab" doesn't always >>> mean "slow path". >> >> >> CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow >> path called when a TLAB allocation fails because the object doesn’t fit in >> the current TLAB. It checks (alocate_from_tlab() / >> allocate_from_tlab_slow()) whether to refill the current TLAB or keep the >> TLAB and delegate to the GC (mem_allocate()) to allocate the object outside >> a TLAB (either in the young or old gen; the GC might also decide to do a >> collection at this point if, say, the eden is full...). So, it depends on >> what you mean by slow path but, yes, any alloocations that go through the >> above path should be considered as “slow path” allocations. >> >> One more piece of data: AllocTracer::send_allocation_outside_tlab_event() >> (the JFR entry point for outside-TLAB allocs) is fired from >> common_mem_allocate_noint(). So, if there are other non-TLAB allocation >> paths outside that method, that entry point has been placed incorrectly >> (it’s possible of course; but I think that it’s actually placed correctly). >> >> (note: I only looked at the JDK 8 sources, haven’t checked the JDK 9 sources >> yet, the above might have been changed) >> >> BTW, when looking at the common_mem_allocate_noinit() code I noticed the >> following: >> >> THREAD->incr_allocated_bytes(size * HeapWordSize); >> (as predicted earlier) >> >> >> >>> >>> Another one that jumps to mind is that we don't know whether the >>> outside-of-TLAB path actually passes the sampling threshold, especially if >>> we let users configure the sampling threshold. So how would we know >>> whether to sample it? >> >> >> See above (IMHO: sample all of them). >> >> >> >>> >>> You also have to keep track of the sampling interval in the code where we >>> allocate new TLABs, in case the sampling threshold is larger than the TLAB >>> size. That's not a big deal, of course. >> >> >> Of course, but that’s kinda trivial. BTW, one approach here would be “given >> that refilling a TLAB is slow anyway, always sample the first object in each >> TLAB irrespective of desired sampling frequence”. Another would be “don’t do >> that, I set the sampling frequency pretty low not to be flooded with data >> when the TLABs are very small”. I have to say I’m in the latter camp. >> >> >> >>> >>> >>> And, every time the TLAB code changes, we have to consider whether / how >>> those changes affect this sampling mechanism. >> >> >> Yes, but how often does the TLAB code change? :-) >> >> >> >>> >>> I guess my larger point is that there are so many little corner cases with >>> TLAB allocation, including whether it even happens, that basing the >>> sampling strategy around it seems like a cop-out. >> >> >> There are not many little corner cases. There are two cases: allocation >> inside a TLAB, allocation outside a TLAB. The former is by far the most >> common. The latter is generally very infrequent and has a well-defined code >> path (I described it earlier). And, as I said, it could be very helpful and >> informative to treat (and account for) the two cases separately. >> >> >> >>> And my belief is that the arguments against our strategy don't really hold >>> water, especially given the presence of the per-thread allocation counter >>> that no one noticed. >> >> >> I’ve already addressed that. >> >> >> >>> >>> Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck >>> Rasbold). All we really need is to write an acceptable JEP, to adjust the >>> code based on the changes the community wants, and someone from Oracle >>> willing to say "yes". >> >> >>> >>> For reference, to keep track of sampling, the delta to C2 is about 150 LOC >>> (much of which is newlines-because-of-formatting for methods that take a >>> lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86 >>> template interpreter is about 20 LOC, and the delta for the assembler is >>> about 40 LOC. It's not completely trivial, but the code hasn't changed >>> substantially in the 5 years since I wrote it (other than a couple of >>> bugfixes). >>> >>> Obviously, assembler/template interpreter would have to be dup'd across >>> platforms - we can do that for PPC and aarch64, on which we do active >>> development, at least. >> >> >> I’ll again vote for the simplicity of having a simple change in only one >> place (OK, two places…). >> >> >> >>> >>> But, seriously, why didn’t you like my proposal? It can do anything your >>> scheme can with fewer and simpler code changes. The only thing that it >>> cannot do is to sample based on object count (i.e., every 100 objects) >>> instead of based on object size (i.e., every 1MB of allocations). But I >>> think doing sampling based on size is the right approach here (IMHO). >>> >>> I agree that sampling based on size is the right approach. >>> >>> (And your approach is definitely simpler - I don't mean to discount it. >>> And if that's what it takes to get this feature accepted, we'll do it, but >>> I'll grumble about it.) >> >> >> That’s fine. :-) >> >> Tony >> >> >> >> >> ----- >> >> Tony Printezis | JVM/GC Engineer / VM Team | Twitter >> >> @TonyPrintezis >> tprinte...@twitter.com >> >> > >