Re: JDK-8171119: Low-Overhead Heap Profiling

Robbin Ehn Wed, 14 Feb 2018 00:13:58 -0800

Hi JC,

Adding Markus Grönlund.


On 2018-02-14 01:11, JC Beyler wrote:

Hi all,

Just to show a bit how to solve the one issue Erik was referring to, consider 
the following webrev:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08a/

and incremental is here:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08_08a/


We have long standing issue of adding stuff directly to the thread.
Please put tracing stuff into a separate data structure and file.
Hoping that this new structure can contain this and thread specific parts of up 
coming tracing framework. (Markus?)

Maybe follow threadSMR, which is contained in:
open/src/hotspot/share/runtime/threadSMR.cpp
open/src/hotspot/share/runtime/threadSMR.inline.hpp
open/src/hotspot/share/runtime/threadSMR.hpp

And create threadTracing.XXX or similar ?

Thanks, Robbin


This puts the sampling bytes left in the Thread class () and then the code goes 
to Thread to sample or not. The advantage of this is that it is probably 
simpler to understand and follow what is going on, there is less of internal 
tlab magic going on and the outside of tlab allocations go through the thread 
instance the same way the TLAB allocations do.

I think it's cleaner but what do you think?

Thanks!
Jc

On Mon, Feb 12, 2018 at 9:18 PM, JC Beyler <[email protected] 
<mailto:[email protected]>> wrote:

    Hi Erik,

    Thanks for your answers, I've now inlined my own answers/comments.

    I've done a new webrev here:
    http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/ 
<http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/>

    The incremental is here:
    http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/ 
<http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/>

    Note to all:
       - I've been integrating changes from Erin/Serguei/David comments so this 
webrev incremental is a bit an answer to all comments in one. I apologize for 
that :)


    On Mon, Feb 12, 2018 at 6:05 AM, Erik Österlund <[email protected] 
<mailto:[email protected]>> wrote:

        Hi JC,

        Sorry for the delayed reply.

        Inlined answers:


        On 2018-02-06 00:04, JC Beyler wrote:

            Hi Erik,

            (Renaming this to be folded into the newly renamed thread :))

            First off, thanks a lot for reviewing the webrev! I appreciate it!

            I updated the webrev to:
            http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/ 
<http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/>

            And the incremental one is here:
            http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/ 
<http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/>

            It contains:
            - The change for since from 9 to 11 for the jvmti.xml
            - The use of the OrderAccess for initialized
            - Clearing the oop

            I also have inlined my answers to your comments. The biggest 
question
            will come from the multiple *_end variables. A bit of the logic 
there
            is due to handling the slow path refill vs fast path refill and
            checking that the rug was not pulled underneath the slowpath. I
            believe that a previous comment was that TlabFastRefill was going to
            be deprecated.

            If this is true, we could revert this code a bit and just do a : if
            TlabFastRefill is enabled, disable this. And then deprecate that 
when
            TlabFastRefill is deprecated.

            This might simplify this webrev and I can work on a follow-up that
            either: removes TlabFastRefill if Robbin does not have the time to 
do
            it or add the support to the assembly side to handle this correctly.
            What do you think?


        I support removing TlabFastRefill, but I think it is good to not depend 
on that happening first.



    I'm slowly pushing on the FastTLABRefill 
(https://bugs.openjdk.java.net/browse/JDK-8194084 
<https://bugs.openjdk.java.net/browse/JDK-8194084>), I agree on keeping both 
separate for now though so that we can think of both differently

            Now, below, inlined are my answers:

            On Fri, Feb 2, 2018 at 8:44 AM, Erik Österlund
            <[email protected] <mailto:[email protected]>> 
wrote:

                Hi JC,

                Hope I am reviewing the right version of your work. Here goes...

                src/hotspot/share/gc/shared/collectedHeap.inline.hpp:

                   159     AllocTracer::send_allocation_outside_tlab(klass, 
result, size *
                HeapWordSize, THREAD);
                   160
                   161     THREAD->tlab().handle_sample(THREAD, result, size);
                   162     return result;
                   163   }

                Should not call tlab()->X without checking if (UseTLAB) IMO.

            Done!


        More about this later.



                src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:

                So first of all, there seems to quite a few ends. There is an "end", 
a "hard
                end", a "slow path end", and an "actual end". Moreover, it 
seems like the
                "hard end" is actually further away than the "actual end". So the 
"hard end"
                seems like more of a "really definitely actual end" or 
something. I don't
                know about you, but I think it looks kind of messy. In 
particular, I don't
                feel like the name "actual end" reflects what it represents, 
especially when
                there is another end that is behind the "actual end".

                   413 HeapWord* ThreadLocalAllocBuffer::hard_end() {
                   414   // Did a fast TLAB refill occur?
                   415   if (_slow_path_end != _end) {
                   416     // Fix up the actual end to be now the end of this 
TLAB.
                   417     _slow_path_end = _end;
                   418     _actual_end = _end;
                   419   }
                   420
                   421   return _actual_end + alignment_reserve();
                   422 }

                I really do not like making getters unexpectedly have these 
kind of side
                effects. It is not expected that when you ask for the "hard 
end", you
                implicitly update the "slow path end" and "actual end" to new 
values.

            As I said, a lot of this is due to the FastTlabRefill. If I make 
this
            not supporting FastTlabRefill, this goes away. The reason the system
            needs to update itself at the get is that you only know at that get 
if
            things have shifted underneath the tlab slow path. I am not sure of
            really better names (naming is hard!), perhaps we could do these
            names:

            - current_tlab_end       // Either the allocated tlab end or a 
sampling point
            - last_allocation_address  // The end of the tlab allocation
            - last_slowpath_allocated_end  // In case a fast refill occurred the
            end might have changed, this is to remember slow vs fast past 
refills

            the hard_end method can be renamed to something like:
            tlab_end_pointer()        // The end of the lab including a bit of
            alignment reserved bytes


        Those names sound better to me. Could you please provide a mapping from 
the old names to the new names so I understand which one is which please?

        This is my current guess of what you are proposing:

        end -> current_tlab_end
        actual_end -> last_allocation_address
        slow_path_end -> last_slowpath_allocated_end
        hard_end -> tlab_end_pointer


    Yes that is correct, that was what I was proposing.

        I would prefer this naming:

        end -> slow_path_end // the end for taking a slow path; either due to 
sampling or refilling
        actual_end -> allocation_end // the end for allocations
        slow_path_end -> last_slow_path_end // last address for slow_path_end 
(as opposed to allocation_end)
        hard_end -> reserved_end // the end of the reserved space of the TLAB

        About setting things in the getter... that still seems like a very 
unpleasant thing to me. It would be better to inspect the call hierarchy and 
explicitly update the ends where they need updating, and assert in the getter 
that they are in sync, rather than implicitly setting various ends as a 
surprising side effect in a getter. It looks like the call hierarchy is very 
small. With my new naming convention, reserved_end() would presumably return 
_allocation_end + alignment_reserve(), and have an assert checking that 
_allocation_end == _last_slow_path_allocation_end, complaining that this
        invariant must hold, and that a caller to this function, such as 
make_parsable(), must first explicitly synchronize the ends as required, to 
honor that invariant.



    I've renamed the variables to how you preferred it except for the _end one. 
I did:
    current_end
    last_allocation_address
    tlab_end_ptr

    The reason is that the architecture dependent code use the thread.hpp API 
and it already has tlab included into the name so it becomes tlab_current_end 
(which is better that tlab_current_tlab_end in my opinion).

    I also moved the update into a separate method with a TODO that says to 
remove it when FastTLABRefill is deprecated


            Not sure it's better but before updating the webrev, I wanted to try
            to get input/consensus :)

            (Note hard_end was always further off than end).

                src/hotspot/share/prims/jvmti.xml:

                10357       <capabilityfield id="can_sample_heap" since="9">
                10358         <description>
                10359           Can sample the heap.
                10360           If this capability is enabled then the heap 
sampling methods
                can be called.
                10361         </description>
                10362       </capabilityfield>

                Looks like this capability should not be "since 9" if it gets 
integrated
                now.

            Updated now to 11, crossing my fingers :)


                src/hotspot/share/runtime/heapMonitoring.cpp:

                   448       if (is_alive->do_object_b(value)) {
                   449         // Update the oop to point to the new object if 
it is still
                alive.
                   450         f->do_oop(&(trace.obj));
                   451
                   452         // Copy the old trace, if it is still live.
                   453         _allocated_traces->at_put(curr_pos++, trace);
                   454
                   455         // Store the live trace in a cache, to be served 
up on /heapz.
                   456         _traces_on_last_full_gc->append(trace);
                   457
                   458         count++;
                   459       } else {
                   460         // If the old trace is no longer live, add it to 
the list of
                   461         // recently collected garbage.
                   462         store_garbage_trace(trace);
                   463       }

                In the case where the oop was not live, I would like it to be 
explicitly
                cleared.

            Done I think how you wanted it. Let me know because I'm not familiar
            with the RootAccess API. I'm unclear if I'm doing this right or not 
so
            reviews of these parts are highly appreciated. Robbin had talked of
            perhaps later pushing this all into a OopStorage, should I do this 
now
            do you think? Or can that wait a second webrev later down the road?


        I think using handles can and should be done later. You can use the 
Access API now.
        I noticed that you are missing an #include "oops/access.inline.hpp" in 
your heapMonitoring.cpp file.


    The missing header is there for me so I don't know, I made sure it is 
present in the latest webrev. Sorry about that.


            + Did I clear it the way you wanted me to or were you thinking of
            something else?


        That is precisely how I wanted it to be cleared. Thanks.

            + Final question here, seems like if I were to want to not do the
            f->do_oop directly on the trace.obj, I'd need to do something like:

                 f->do_oop(&value);
                 ...
                 trace->store_oop(value);

            to update the oop internally. Is that right/is that one of the
            advantages of going to the Oopstorage sooner than later?


        I think you really want to do the do_oop on the root directly. Is there 
a particular reason why you would not want to do that?
        Otherwise, yes - the benefit with using the handle approach is that you 
do not need to call do_oop explicitly in your code.


    There is no reason except that now we have a load_oop and a get_oop_addr, I 
was not sure what you would think of that.



                Also I see a lot of concurrent-looking use of the following 
field:
                   267   volatile bool _initialized;

                Please note that the "volatile" qualifier does not help with 
reordering
                here. Reordering between volatile and non-volatile fields is 
completely free
                for both compiler and hardware, except for windows with MSVC, 
where volatile
                semantics is defined to use acquire/release semantics, and the 
hardware is
                TSO. But for the general case, I would expect this field to be 
stored with
                OrderAccess::release_store and loaded with 
OrderAccess::load_acquire.
                Otherwise it is not thread safe.

            Because everything is behind a mutex, I wasn't really worried about
            this. I have a test that has multiple threads trying to hit this
            corner case and it passes.

            However, to be paranoid, I updated it to using the OrderAccess API
            now, thanks! Let me know what you think there too!


        If it is indeed always supposed to be read and written under a mutex, 
then I would strongly prefer to have it accessed as a normal non-volatile 
member, and have an assertion that given lock is held or we are in a safepoint, 
as we do in many other places. Something like this:

        assert(HeapMonitorStorage_lock->owned_by_self() || (SafepointSynchronize::is_at_safepoint() 
&& Thread::current()->is_VM_thread()), "this should not be accessed concurrently");

        It would be confusing to people reading the code if there are uses of 
OrderAccess that are actually always protected under a mutex.


    Thank you for the exact example to be put in the code! I put it around each 
access/assignment of the _initialized method and found one case where yes you can touch 
it and not have the lock. It actually is "ok" because you don't act on the 
storage until later and only when you really want to modify the storage (see the 
object_alloc_do_sample method which calls the add_trace method).

    But, because of this, I'm going to put the OrderAccess here, I'll do some performance numbers 
later and if there are issues, I might add a "unsafe" read and a "safe" one to 
make it explicit to the reader. But I don't think it will come to that.

                As a kind of meta comment, I wonder if it would make sense to 
add sampling
                for non-TLAB allocations. Seems like if someone is rapidly 
allocating a
                whole bunch of 1 MB objects that never fit in a TLAB, I might 
still be
                interested in seeing that in my traces, and not get surprised 
that the
                allocation rate is very high yet not showing up in any profiles.

            That is handled by the handle_sample where you wanted me to put a
            UseTlab because you hit that case if the allocation is too big.


        I see. It was not obvious to me that non-TLAB sampling is done in the 
TLAB class. That seems like an abstraction crime.
        What I wanted in my previous comment was that we do not call into the 
TLAB when we are not using TLABs. If there is sampling logic in the TLAB that 
is used for something else than TLABs, then it seems like that logic simply 
does not belong inside of the TLAB. It should be moved out of the TLAB, and 
instead have the TLAB call this common abstraction that makes sense.


    So in the incremental version:
    http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/ 
<http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/>, this is still a "crime". 
The reason is that the system has to have the bytes_until_sample on a per-thread level and it made 
"sense" to have it with the TLAB implementation. Also, I was not sure how people felt about 
adding something to the thread instance instead.

    Do you think it fits better at the Thread level? I can see how difficult it 
is to make it happen there and add some logic there. Let me know what you think.

        Hope I have answered your questions and that my feedback makes sense to 
you.


    You have and thank you for them, I think we are getting to a cleaner 
implementation and things are getting better and more readable :)

    Thanks for your help!
    Jc

        Thanks,
        /Erik


            I double checked by changing the test
            
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java
 
<http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java>

            to use a smaller Tlab (2048) and made the object bigger and it goes
            through that and passes.

            Thanks again for your review and I look forward to your pointers for
            the questions I now have raised!
            Jc









                Thanks,
                /Erik


                On 2018-01-26 06:45, JC Beyler wrote:

                    Thanks Robbin for the reviews :)

                    The new full webrev is here:
                    http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.03/ 
<http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.03/>
                    The incremental webrev is here:
                    http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.02_03/ 
<http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.02_03/>

                    I inlined my answers:

                    On Thu, Jan 25, 2018 at 1:15 AM, Robbin Ehn <[email protected] 
<mailto:[email protected]>> wrote:

                        Hi JC, great to see another revision!

                        ####
                        heapMonitoring.cpp

                        StackTraceData should not contain the oop for 'safety' 
reasons.
                        When StackTraceData is moved from _allocated_traces:
                        L452 store_garbage_trace(trace);
                        it contains a dead oop.
                        _allocated_traces could instead be a tupel of oop and 
StackTraceData thus
                        dead oops are not kept.

                    Done I used inheritance to make the copier work regardless 
but the
                    idea is the same.

                        You should use the new Access API for loading the oop, 
something like
                        this:
                        RootAccess<ON_PHANTOM_OOP_REF | 
AS_NO_KEEPALIVE>::load(...)
                        I don't think you need to use Access API for clearing 
the oop, but it
                        would
                        look nicer. And you shouldn't probably be using:
                        Universe::heap()->is_in_reserved(value)

                    I am unfamiliar with this but I think I did do it like you 
wanted me
                    to (all tests pass so that's a start). I'm not sure how to 
clear the
                    oop exactly, is there somewhere that does that, which I can 
use to do
                    the same?

                    I removed the is_in_reserved, this came from our internal 
version, I
                    don't know why it was there but my tests work without so I 
removed it
                    :)


                        The lock:
                        L424   MutexLocker mu(HeapMonitorStorage_lock);
                        Is not needed as far as I can see.
                        weak_oops_do is called in a safepoint, no TLAB 
allocation can happen and
                        JVMTI thread can't access these data-structures. Is 
there something more
                        to
                        this lock that I'm missing?

                    Since a thread can call the JVMTI getLiveTraces (or any of 
the other
                    ones), it can get to the point of trying to copying the
                    _allocated_traces. I imagine it is possible that this is 
happening
                    during a GC or that it can be started and a GC happens 
afterwards.
                    Therefore, it seems to me that you want this protected, no?


                        ####
                        You have 6 files without any changes in them (any more):
                        g1CollectedHeap.cpp
                        psMarkSweep.cpp
                        psParallelCompact.cpp
                        genCollectedHeap.cpp
                        referenceProcessor.cpp
                        thread.hpp

                    Done.

                        ####
                        I have not looked closely, but is it possible to hide 
heap sampling in
                        AllocTracer ? (with some minor changes to the 
AllocTracer API)

                    I am imagining that you are saying to move the code that 
does the
                    sampling code (change the tlab end, do the call to 
HeapMonitoring,
                    etc.) into the AllocTracer code itself? I think that is 
right and I'll
                    look if that is possible and prepare a webrev to show what 
would be
                    needed to make that happen.

                        ####
                        Minor nit, when declaring pointer there is a little mix 
of having the
                        pointer adjacent by type name and data name. (Most 
hotspot code is by
                        type
                        name)
                        E.g.
                        heapMonitoring.cpp:711     jvmtiStackTrace *trace = ....
                        heapMonitoring.cpp:733         Method* m = 
vfst.method();
                        (not just this file)

                    Done!

                        ####
                        HeapMonitorThreadOnOffTest.java:77
                        I would make g_tmp volatile, otherwise the assignment 
in loop may
                        theoretical be skipped.

                    Also done!

                    Thanks again!
                    Jc

Re: JDK-8171119: Low-Overhead Heap Profiling

Reply via email to