Re: JDK-8171119: Low-Overhead Heap Profiling

Erik Österlund Mon, 12 Feb 2018 06:03:39 -0800

Hi JC,

Sorry for the delayed reply.


Inlined answers:

On 2018-02-06 00:04, JC Beyler wrote:

Hi Erik,

(Renaming this to be folded into the newly renamed thread :))

First off, thanks a lot for reviewing the webrev! I appreciate it!

I updated the webrev to:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/

And the incremental one is here:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/

It contains:
- The change for since from 9 to 11 for the jvmti.xml
- The use of the OrderAccess for initialized
- Clearing the oop

I also have inlined my answers to your comments. The biggest question
will come from the multiple *_end variables. A bit of the logic there
is due to handling the slow path refill vs fast path refill and
checking that the rug was not pulled underneath the slowpath. I
believe that a previous comment was that TlabFastRefill was going to
be deprecated.

If this is true, we could revert this code a bit and just do a : if
TlabFastRefill is enabled, disable this. And then deprecate that when
TlabFastRefill is deprecated.

This might simplify this webrev and I can work on a follow-up that
either: removes TlabFastRefill if Robbin does not have the time to do
it or add the support to the assembly side to handle this correctly.
What do you think?

I support removing TlabFastRefill, but I think it is good to not dependon that happening first.

Now, below, inlined are my answers:

On Fri, Feb 2, 2018 at 8:44 AM, Erik Österlund
<[email protected]> wrote:

Hi JC,

Hope I am reviewing the right version of your work. Here goes...

src/hotspot/share/gc/shared/collectedHeap.inline.hpp:

  159     AllocTracer::send_allocation_outside_tlab(klass, result, size *
HeapWordSize, THREAD);
  160
  161     THREAD->tlab().handle_sample(THREAD, result, size);
  162     return result;
  163   }

Should not call tlab()->X without checking if (UseTLAB) IMO.

Done!


More about this later.

src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:

So first of all, there seems to quite a few ends. There is an "end", a "hard
end", a "slow path end", and an "actual end". Moreover, it seems like the
"hard end" is actually further away than the "actual end". So the "hard end"
seems like more of a "really definitely actual end" or something. I don't
know about you, but I think it looks kind of messy. In particular, I don't
feel like the name "actual end" reflects what it represents, especially when
there is another end that is behind the "actual end".

  413 HeapWord* ThreadLocalAllocBuffer::hard_end() {
  414   // Did a fast TLAB refill occur?
  415   if (_slow_path_end != _end) {
  416     // Fix up the actual end to be now the end of this TLAB.
  417     _slow_path_end = _end;
  418     _actual_end = _end;
  419   }
  420
  421   return _actual_end + alignment_reserve();
  422 }

I really do not like making getters unexpectedly have these kind of side
effects. It is not expected that when you ask for the "hard end", you
implicitly update the "slow path end" and "actual end" to new values.

As I said, a lot of this is due to the FastTlabRefill. If I make this
not supporting FastTlabRefill, this goes away. The reason the system
needs to update itself at the get is that you only know at that get if
things have shifted underneath the tlab slow path. I am not sure of
really better names (naming is hard!), perhaps we could do these
names:

- current_tlab_end       // Either the allocated tlab end or a sampling point
- last_allocation_address  // The end of the tlab allocation
- last_slowpath_allocated_end  // In case a fast refill occurred the
end might have changed, this is to remember slow vs fast past refills

the hard_end method can be renamed to something like:
tlab_end_pointer()        // The end of the lab including a bit of
alignment reserved bytes

Those names sound better to me. Could you please provide a mapping fromthe old names to the new names so I understand which one is which please?


This is my current guess of what you are proposing:

end -> current_tlab_end
actual_end -> last_allocation_address
slow_path_end -> last_slowpath_allocated_end
hard_end -> tlab_end_pointer

I would prefer this naming:

end -> slow_path_end // the end for taking a slow path; either due tosampling or refilling

actual_end -> allocation_end // the end for allocations

slow_path_end -> last_slow_path_end // last address for slow_path_end(as opposed to allocation_end)

hard_end -> reserved_end // the end of the reserved space of the TLAB

About setting things in the getter... that still seems like a veryunpleasant thing to me. It would be better to inspect the call hierarchyand explicitly update the ends where they need updating, and assert inthe getter that they are in sync, rather than implicitly setting variousends as a surprising side effect in a getter. It looks like the callhierarchy is very small. With my new naming convention, reserved_end()would presumably return _allocation_end + alignment_reserve(), and havean assert checking that _allocation_end ==_last_slow_path_allocation_end, complaining that this invariant musthold, and that a caller to this function, such as make_parsable(), mustfirst explicitly synchronize the ends as required, to honor that invariant.

Not sure it's better but before updating the webrev, I wanted to try
to get input/consensus :)

(Note hard_end was always further off than end).

src/hotspot/share/prims/jvmti.xml:

10357       <capabilityfield id="can_sample_heap" since="9">
10358         <description>
10359           Can sample the heap.
10360           If this capability is enabled then the heap sampling methods
can be called.
10361         </description>
10362       </capabilityfield>

Looks like this capability should not be "since 9" if it gets integrated
now.

Updated now to 11, crossing my fingers :)

src/hotspot/share/runtime/heapMonitoring.cpp:

  448       if (is_alive->do_object_b(value)) {
  449         // Update the oop to point to the new object if it is still
alive.
  450         f->do_oop(&(trace.obj));
  451
  452         // Copy the old trace, if it is still live.
  453         _allocated_traces->at_put(curr_pos++, trace);
  454
  455         // Store the live trace in a cache, to be served up on /heapz.
  456         _traces_on_last_full_gc->append(trace);
  457
  458         count++;
  459       } else {
  460         // If the old trace is no longer live, add it to the list of
  461         // recently collected garbage.
  462         store_garbage_trace(trace);
  463       }

In the case where the oop was not live, I would like it to be explicitly
cleared.

Done I think how you wanted it. Let me know because I'm not familiar
with the RootAccess API. I'm unclear if I'm doing this right or not so
reviews of these parts are highly appreciated. Robbin had talked of
perhaps later pushing this all into a OopStorage, should I do this now
do you think? Or can that wait a second webrev later down the road?

I think using handles can and should be done later. You can use theAccess API now.I noticed that you are missing an #include "oops/access.inline.hpp" inyour heapMonitoring.cpp file.

+ Did I clear it the way you wanted me to or were you thinking of
something else?


That is precisely how I wanted it to be cleared. Thanks.

+ Final question here, seems like if I were to want to not do the
f->do_oop directly on the trace.obj, I'd need to do something like:

    f->do_oop(&value);
    ...
    trace->store_oop(value);

to update the oop internally. Is that right/is that one of the
advantages of going to the Oopstorage sooner than later?

I think you really want to do the do_oop on the root directly. Is therea particular reason why you would not want to do that?Otherwise, yes - the benefit with using the handle approach is that youdo not need to call do_oop explicitly in your code.

Also I see a lot of concurrent-looking use of the following field:
  267   volatile bool _initialized;

Please note that the "volatile" qualifier does not help with reordering
here. Reordering between volatile and non-volatile fields is completely free
for both compiler and hardware, except for windows with MSVC, where volatile
semantics is defined to use acquire/release semantics, and the hardware is
TSO. But for the general case, I would expect this field to be stored with
OrderAccess::release_store and loaded with OrderAccess::load_acquire.
Otherwise it is not thread safe.

Because everything is behind a mutex, I wasn't really worried about
this. I have a test that has multiple threads trying to hit this
corner case and it passes.

However, to be paranoid, I updated it to using the OrderAccess API
now, thanks! Let me know what you think there too!

If it is indeed always supposed to be read and written under a mutex,then I would strongly prefer to have it accessed as a normalnon-volatile member, and have an assertion that given lock is held or weare in a safepoint, as we do in many other places. Something like this:

assert(HeapMonitorStorage_lock->owned_by_self() ||(SafepointSynchronize::is_at_safepoint() &&Thread::current()->is_VM_thread()), "this should not be accessedconcurrently");

It would be confusing to people reading the code if there are uses ofOrderAccess that are actually always protected under a mutex.

As a kind of meta comment, I wonder if it would make sense to add sampling
for non-TLAB allocations. Seems like if someone is rapidly allocating a
whole bunch of 1 MB objects that never fit in a TLAB, I might still be
interested in seeing that in my traces, and not get surprised that the
allocation rate is very high yet not showing up in any profiles.

That is handled by the handle_sample where you wanted me to put a
UseTlab because you hit that case if the allocation is too big.

I see. It was not obvious to me that non-TLAB sampling is done in theTLAB class. That seems like an abstraction crime.What I wanted in my previous comment was that we do not call into theTLAB when we are not using TLABs. If there is sampling logic in the TLABthat is used for something else than TLABs, then it seems like thatlogic simply does not belong inside of the TLAB. It should be moved outof the TLAB, and instead have the TLAB call this common abstraction thatmakes sense.


Hope I have answered your questions and that my feedback makes sense to you.

Thanks,
/Erik

I double checked by changing the test
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java

to use a smaller Tlab (2048) and made the object bigger and it goes
through that and passes.

Thanks again for your review and I look forward to your pointers for
the questions I now have raised!
Jc

Thanks,
/Erik


On 2018-01-26 06:45, JC Beyler wrote:

Thanks Robbin for the reviews :)

The new full webrev is here:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.03/
The incremental webrev is here:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.02_03/

I inlined my answers:

On Thu, Jan 25, 2018 at 1:15 AM, Robbin Ehn <[email protected]> wrote:

Hi JC, great to see another revision!

####
heapMonitoring.cpp

StackTraceData should not contain the oop for 'safety' reasons.
When StackTraceData is moved from _allocated_traces:
L452 store_garbage_trace(trace);
it contains a dead oop.
_allocated_traces could instead be a tupel of oop and StackTraceData thus
dead oops are not kept.

Done I used inheritance to make the copier work regardless but the
idea is the same.

You should use the new Access API for loading the oop, something like
this:
RootAccess<ON_PHANTOM_OOP_REF | AS_NO_KEEPALIVE>::load(...)
I don't think you need to use Access API for clearing the oop, but it
would
look nicer. And you shouldn't probably be using:
Universe::heap()->is_in_reserved(value)

I am unfamiliar with this but I think I did do it like you wanted me
to (all tests pass so that's a start). I'm not sure how to clear the
oop exactly, is there somewhere that does that, which I can use to do
the same?

I removed the is_in_reserved, this came from our internal version, I
don't know why it was there but my tests work without so I removed it
:)

The lock:
L424   MutexLocker mu(HeapMonitorStorage_lock);
Is not needed as far as I can see.
weak_oops_do is called in a safepoint, no TLAB allocation can happen and
JVMTI thread can't access these data-structures. Is there something more
to
this lock that I'm missing?

Since a thread can call the JVMTI getLiveTraces (or any of the other
ones), it can get to the point of trying to copying the
_allocated_traces. I imagine it is possible that this is happening
during a GC or that it can be started and a GC happens afterwards.
Therefore, it seems to me that you want this protected, no?

####
You have 6 files without any changes in them (any more):
g1CollectedHeap.cpp
psMarkSweep.cpp
psParallelCompact.cpp
genCollectedHeap.cpp
referenceProcessor.cpp
thread.hpp

Done.

####
I have not looked closely, but is it possible to hide heap sampling in
AllocTracer ? (with some minor changes to the AllocTracer API)

I am imagining that you are saying to move the code that does the
sampling code (change the tlab end, do the call to HeapMonitoring,
etc.) into the AllocTracer code itself? I think that is right and I'll
look if that is possible and prepare a webrev to show what would be
needed to make that happen.

####
Minor nit, when declaring pointer there is a little mix of having the
pointer adjacent by type name and data name. (Most hotspot code is by
type
name)
E.g.
heapMonitoring.cpp:711     jvmtiStackTrace *trace = ....
heapMonitoring.cpp:733         Method* m = vfst.method();
(not just this file)

Done!

####
HeapMonitorThreadOnOffTest.java:77
I would make g_tmp volatile, otherwise the assignment in loop may
theoretical be skipped.

Also done!

Thanks again!
Jc

Re: JDK-8171119: Low-Overhead Heap Profiling

Reply via email to