Re: JDK-8171119: Low-Overhead Heap Profiling

Erik Österlund Wed, 14 Feb 2018 02:50:42 -0800

Hi JC,

Comments are inlined below.


On 2018-02-13 06:18, JC Beyler wrote:

Hi Erik,

Thanks for your answers, I've now inlined my own answers/comments.

I've done a new webrev here:

http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.08/>


The incremental is here:

http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>


Note to all:

- I've been integrating changes from Erin/Serguei/David comments sothis webrev incremental is a bit an answer to all comments in one. Iapologize for that :)

On Mon, Feb 12, 2018 at 6:05 AM, Erik Österlund<[email protected] <mailto:[email protected]>> wrote:


    Hi JC,

    Sorry for the delayed reply.

    Inlined answers:


    On 2018-02-06 00:04, JC Beyler wrote:

        Hi Erik,

        (Renaming this to be folded into the newly renamed thread :))

        First off, thanks a lot for reviewing the webrev! I appreciate it!

        I updated the webrev to:
        http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/
        <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/>

        And the incremental one is here:
        http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/
        <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.04_05a/>

        It contains:
        - The change for since from 9 to 11 for the jvmti.xml
        - The use of the OrderAccess for initialized
        - Clearing the oop

        I also have inlined my answers to your comments. The biggest
        question
        will come from the multiple *_end variables. A bit of the
        logic there
        is due to handling the slow path refill vs fast path refill and
        checking that the rug was not pulled underneath the slowpath. I
        believe that a previous comment was that TlabFastRefill was
        going to
        be deprecated.

        If this is true, we could revert this code a bit and just do a
        : if
        TlabFastRefill is enabled, disable this. And then deprecate
        that when
        TlabFastRefill is deprecated.

        This might simplify this webrev and I can work on a follow-up that
        either: removes TlabFastRefill if Robbin does not have the
        time to do
        it or add the support to the assembly side to handle this
        correctly.
        What do you think?


    I support removing TlabFastRefill, but I think it is good to not
    depend on that happening first.

I'm slowly pushing on the FastTLABRefill(https://bugs.openjdk.java.net/browse/JDK-8194084<https://bugs.openjdk.java.net/browse/JDK-8194084>), I agree onkeeping both separate for now though so that we can think of bothdifferently


        Now, below, inlined are my answers:

        On Fri, Feb 2, 2018 at 8:44 AM, Erik Österlund
        <[email protected] <mailto:[email protected]>>
        wrote:

            Hi JC,

            Hope I am reviewing the right version of your work. Here
            goes...

            src/hotspot/share/gc/shared/collectedHeap.inline.hpp:

              159     AllocTracer::send_allocation_outside_tlab(klass,
            result, size *
            HeapWordSize, THREAD);
              160
              161     THREAD->tlab().handle_sample(THREAD, result, size);
              162     return result;
              163   }

            Should not call tlab()->X without checking if (UseTLAB) IMO.

        Done!


    More about this later.



            src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:

            So first of all, there seems to quite a few ends. There is
            an "end", a "hard
            end", a "slow path end", and an "actual end". Moreover, it
            seems like the
            "hard end" is actually further away than the "actual end".
            So the "hard end"
            seems like more of a "really definitely actual end" or
            something. I don't
            know about you, but I think it looks kind of messy. In
            particular, I don't
            feel like the name "actual end" reflects what it
            represents, especially when
            there is another end that is behind the "actual end".

              413 HeapWord* ThreadLocalAllocBuffer::hard_end() {
              414   // Did a fast TLAB refill occur?
              415   if (_slow_path_end != _end) {
              416     // Fix up the actual end to be now the end of
            this TLAB.
              417     _slow_path_end = _end;
              418     _actual_end = _end;
              419   }
              420
              421   return _actual_end + alignment_reserve();
              422 }

            I really do not like making getters unexpectedly have
            these kind of side
            effects. It is not expected that when you ask for the
            "hard end", you
            implicitly update the "slow path end" and "actual end" to
            new values.

        As I said, a lot of this is due to the FastTlabRefill. If I
        make this
        not supporting FastTlabRefill, this goes away. The reason the
        system
        needs to update itself at the get is that you only know at
        that get if
        things have shifted underneath the tlab slow path. I am not
        sure of
        really better names (naming is hard!), perhaps we could do these
        names:

        - current_tlab_end       // Either the allocated tlab end or a
        sampling point
        - last_allocation_address  // The end of the tlab allocation
        - last_slowpath_allocated_end  // In case a fast refill
        occurred the
        end might have changed, this is to remember slow vs fast past
        refills

        the hard_end method can be renamed to something like:
        tlab_end_pointer()        // The end of the lab including a bit of
        alignment reserved bytes


    Those names sound better to me. Could you please provide a mapping
    from the old names to the new names so I understand which one is
    which please?

    This is my current guess of what you are proposing:

    end -> current_tlab_end
    actual_end -> last_allocation_address
    slow_path_end -> last_slowpath_allocated_end
    hard_end -> tlab_end_pointer


Yes that is correct, that was what I was proposing.

    I would prefer this naming:

    end -> slow_path_end // the end for taking a slow path; either due
    to sampling or refilling
    actual_end -> allocation_end // the end for allocations
    slow_path_end -> last_slow_path_end // last address for
    slow_path_end (as opposed to allocation_end)
    hard_end -> reserved_end // the end of the reserved space of the TLAB

    About setting things in the getter... that still seems like a very
    unpleasant thing to me. It would be better to inspect the call
    hierarchy and explicitly update the ends where they need updating,
    and assert in the getter that they are in sync, rather than
    implicitly setting various ends as a surprising side effect in a
    getter. It looks like the call hierarchy is very small. With my
    new naming convention, reserved_end() would presumably return
    _allocation_end + alignment_reserve(), and have an assert checking
    that _allocation_end == _last_slow_path_allocation_end,
    complaining that this invariant must hold, and that a caller to
    this function, such as make_parsable(), must first explicitly
    synchronize the ends as required, to honor that invariant.

I've renamed the variables to how you preferred it except for the _endone. I did:

current_end
last_allocation_address
tlab_end_ptr

The reason is that the architecture dependent code use the thread.hppAPI and it already has tlab included into the name so it becomestlab_current_end (which is better that tlab_current_tlab_end in myopinion).

I also moved the update into a separate method with a TODO that saysto remove it when FastTLABRefill is deprecated


This looks a lot better now. Thanks.

Note that the following comment now needs updating accordingly inthreadLocalAllocBuffer.hpp:


41 // Heap sampling is performed via the end/actual_end fields.
42 // actual_end contains the real end of the tlab allocation,
43 // whereas end can be set to an arbitrary spot in the tlab to
44 // trip the return and sample the allocation.
45 // slow_path_end is used to track if a fast tlab refill occured
46 // between slowpath calls.

There might be other comments too, I have not looked in detail.



        Not sure it's better but before updating the webrev, I wanted
        to try
        to get input/consensus :)

        (Note hard_end was always further off than end).

            src/hotspot/share/prims/jvmti.xml:

            10357       <capabilityfield id="can_sample_heap" since="9">
            10358         <description>
            10359           Can sample the heap.
            10360           If this capability is enabled then the
            heap sampling methods
            can be called.
            10361         </description>
            10362       </capabilityfield>

            Looks like this capability should not be "since 9" if it
            gets integrated
            now.

        Updated now to 11, crossing my fingers :)


            src/hotspot/share/runtime/heapMonitoring.cpp:

              448       if (is_alive->do_object_b(value)) {
              449         // Update the oop to point to the new object
            if it is still
            alive.
              450         f->do_oop(&(trace.obj));
              451
              452         // Copy the old trace, if it is still live.
              453         _allocated_traces->at_put(curr_pos++, trace);
              454
              455         // Store the live trace in a cache, to be
            served up on /heapz.
              456         _traces_on_last_full_gc->append(trace);
              457
              458         count++;
              459       } else {
              460         // If the old trace is no longer live, add
            it to the list of
              461         // recently collected garbage.
              462         store_garbage_trace(trace);
              463       }

            In the case where the oop was not live, I would like it to
            be explicitly
            cleared.

        Done I think how you wanted it. Let me know because I'm not
        familiar
        with the RootAccess API. I'm unclear if I'm doing this right
        or not so
        reviews of these parts are highly appreciated. Robbin had
        talked of
        perhaps later pushing this all into a OopStorage, should I do
        this now
        do you think? Or can that wait a second webrev later down the
        road?


    I think using handles can and should be done later. You can use
    the Access API now.
    I noticed that you are missing an #include
    "oops/access.inline.hpp" in your heapMonitoring.cpp file.

The missing header is there for me so I don't know, I made sure it ispresent in the latest webrev. Sorry about that.



        + Did I clear it the way you wanted me to or were you thinking of
        something else?


    That is precisely how I wanted it to be cleared. Thanks.

        + Final question here, seems like if I were to want to not do the
        f->do_oop directly on the trace.obj, I'd need to do something
        like:

            f->do_oop(&value);
            ...
            trace->store_oop(value);

        to update the oop internally. Is that right/is that one of the
        advantages of going to the Oopstorage sooner than later?


    I think you really want to do the do_oop on the root directly. Is
    there a particular reason why you would not want to do that?
    Otherwise, yes - the benefit with using the handle approach is
    that you do not need to call do_oop explicitly in your code.

There is no reason except that now we have a load_oop and aget_oop_addr, I was not sure what you would think of that.


That's fine.


            Also I see a lot of concurrent-looking use of the
            following field:
              267   volatile bool _initialized;

            Please note that the "volatile" qualifier does not help
            with reordering
            here. Reordering between volatile and non-volatile fields
            is completely free
            for both compiler and hardware, except for windows with
            MSVC, where volatile
            semantics is defined to use acquire/release semantics, and
            the hardware is
            TSO. But for the general case, I would expect this field
            to be stored with
            OrderAccess::release_store and loaded with
            OrderAccess::load_acquire.
            Otherwise it is not thread safe.

        Because everything is behind a mutex, I wasn't really worried
        about
        this. I have a test that has multiple threads trying to hit this
        corner case and it passes.

        However, to be paranoid, I updated it to using the OrderAccess API
        now, thanks! Let me know what you think there too!


    If it is indeed always supposed to be read and written under a
    mutex, then I would strongly prefer to have it accessed as a
    normal non-volatile member, and have an assertion that given lock
    is held or we are in a safepoint, as we do in many other places.
    Something like this:

    assert(HeapMonitorStorage_lock->owned_by_self() ||
    (SafepointSynchronize::is_at_safepoint() &&
    Thread::current()->is_VM_thread()), "this should not be accessed
    concurrently");

    It would be confusing to people reading the code if there are uses
    of OrderAccess that are actually always protected under a mutex.

Thank you for the exact example to be put in the code! I put it aroundeach access/assignment of the _initialized method and found one casewhere yes you can touch it and not have the lock. It actually is "ok"because you don't act on the storage until later and only when youreally want to modify the storage (see the object_alloc_do_samplemethod which calls the add_trace method).

But, because of this, I'm going to put the OrderAccess here, I'll dosome performance numbers later and if there are issues, I might add a"unsafe" read and a "safe" one to make it explicit to the reader. ButI don't think it will come to that.


Okay. This double return in heapMonitoring.cpp looks wrong:

 283   bool initialized() {
 284     return OrderAccess::load_acquire(&_initialized) != 0;
 285     return _initialized;
 286   }

Since you said object_alloc_do_sample() is the only place where you donot hold the mutex while reading initialized(), I had a closer look atthat. It looks like in its current shape, the lack of a mutex may leadto a memory leak. In particular, it first checks if (initialized()).Let's assume this is now true. It then allocates a bunch of stuff, andchecks if the number of frames were over 0. If they were, it callsStackTraceStorage::storage()->add_trace() seemingly hoping that aftergrabbing the lock in there, initialized() will still return true. But itcould now return false and skip doing anything, in which case theallocated stuff will never be freed.

So the analysis seems to be that _initialized is only used outside ofthe mutex in once instance, where it is used to perform double-checkedlocking, that actually causes a memory leak.

I am not proposing how to fix that, just raising the issue. If you stillwant to perform this double-checked locking somehow, then the use ofacquire/release still seems odd. Because the memory orderingrestrictions of it never comes into play in this particular case. If itever did, then the use of destroy_stuff(); release_store(_initialized,0) would be broken anyway as that would imply that whatever concurrentreader there ever was would after reading _initialized withload_acquire() could *never* read the data that is concurrentlydestroyed anyway. I would be biased to think thatRawAccess<MO_RELAXED>::load/store looks like a more appropriatesolution, given that the memory leak issue is resolved. I do not knowhow painful it would be to not perform this double-checked locking.

            As a kind of meta comment, I wonder if it would make sense
            to add sampling
            for non-TLAB allocations. Seems like if someone is rapidly
            allocating a
            whole bunch of 1 MB objects that never fit in a TLAB, I
            might still be
            interested in seeing that in my traces, and not get
            surprised that the
            allocation rate is very high yet not showing up in any
            profiles.

        That is handled by the handle_sample where you wanted me to put a
        UseTlab because you hit that case if the allocation is too big.


    I see. It was not obvious to me that non-TLAB sampling is done in
    the TLAB class. That seems like an abstraction crime.
    What I wanted in my previous comment was that we do not call into
    the TLAB when we are not using TLABs. If there is sampling logic
    in the TLAB that is used for something else than TLABs, then it
    seems like that logic simply does not belong inside of the TLAB.
    It should be moved out of the TLAB, and instead have the TLAB call
    this common abstraction that makes sense.


So in the incremental version:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>, thisis still a "crime". The reason is that the system has to have thebytes_until_sample on a per-thread level and it made "sense" to haveit with the TLAB implementation. Also, I was not sure how people feltabout adding something to the thread instance instead.
Do you think it fits better at the Thread level? I can see howdifficult it is to make it happen there and add some logic there. Letme know what you think.

We have an unfortunate situation where everyone that has some fieldsthat are thread local tend to dump them right into Thread, making thesize and complexity of Thread grow as it becomes tightly coupled withvarious unrelated subsystems. It would be desirable to have a separateclass for this instead that encapsulates the sampling logic. That classcould possibly reside in Thread though as a value object of Thread.

    Hope I have answered your questions and that my feedback makes
    sense to you.
You have and thank you for them, I think we are getting to a cleanerimplementation and things are getting better and more readable :)


Yes it is getting better.

Thanks,
/Erik

Thanks for your help!
Jc

    Thanks,
    /Erik


        I double checked by changing the test
        
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java
        
<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java>

        to use a smaller Tlab (2048) and made the object bigger and it
        goes
        through that and passes.

        Thanks again for your review and I look forward to your
        pointers for
        the questions I now have raised!
        Jc









            Thanks,
            /Erik


            On 2018-01-26 06:45, JC Beyler wrote:

                Thanks Robbin for the reviews :)

                The new full webrev is here:
                http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.03/ 
<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.03/>
                The incremental webrev is here:
                http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.02_03/
                <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.02_03/>

                I inlined my answers:

                On Thu, Jan 25, 2018 at 1:15 AM, Robbin Ehn
                <[email protected] <mailto:[email protected]>>
                wrote:

                    Hi JC, great to see another revision!

                    ####
                    heapMonitoring.cpp

                    StackTraceData should not contain the oop for
                    'safety' reasons.
                    When StackTraceData is moved from _allocated_traces:
                    L452 store_garbage_trace(trace);
                    it contains a dead oop.
                    _allocated_traces could instead be a tupel of oop
                    and StackTraceData thus
                    dead oops are not kept.

                Done I used inheritance to make the copier work
                regardless but the
                idea is the same.

                    You should use the new Access API for loading the
                    oop, something like
                    this:
                    RootAccess<ON_PHANTOM_OOP_REF |
                    AS_NO_KEEPALIVE>::load(...)
                    I don't think you need to use Access API for
                    clearing the oop, but it
                    would
                    look nicer. And you shouldn't probably be using:
                    Universe::heap()->is_in_reserved(value)

                I am unfamiliar with this but I think I did do it like
                you wanted me
                to (all tests pass so that's a start). I'm not sure
                how to clear the
                oop exactly, is there somewhere that does that, which
                I can use to do
                the same?

                I removed the is_in_reserved, this came from our
                internal version, I
                don't know why it was there but my tests work without
                so I removed it
                :)


                    The lock:
                    L424   MutexLocker mu(HeapMonitorStorage_lock);
                    Is not needed as far as I can see.
                    weak_oops_do is called in a safepoint, no TLAB
                    allocation can happen and
                    JVMTI thread can't access these data-structures.
                    Is there something more
                    to
                    this lock that I'm missing?

                Since a thread can call the JVMTI getLiveTraces (or
                any of the other
                ones), it can get to the point of trying to copying the
                _allocated_traces. I imagine it is possible that this
                is happening
                during a GC or that it can be started and a GC happens
                afterwards.
                Therefore, it seems to me that you want this
                protected, no?


                    ####
                    You have 6 files without any changes in them (any
                    more):
                    g1CollectedHeap.cpp
                    psMarkSweep.cpp
                    psParallelCompact.cpp
                    genCollectedHeap.cpp
                    referenceProcessor.cpp
                    thread.hpp

                Done.

                    ####
                    I have not looked closely, but is it possible to
                    hide heap sampling in
                    AllocTracer ? (with some minor changes to the
                    AllocTracer API)

                I am imagining that you are saying to move the code
                that does the
                sampling code (change the tlab end, do the call to
                HeapMonitoring,
                etc.) into the AllocTracer code itself? I think that
                is right and I'll
                look if that is possible and prepare a webrev to show
                what would be
                needed to make that happen.

                    ####
                    Minor nit, when declaring pointer there is a
                    little mix of having the
                    pointer adjacent by type name and data name. (Most
                    hotspot code is by
                    type
                    name)
                    E.g.
                    heapMonitoring.cpp:711     jvmtiStackTrace *trace
                    = ....
                    heapMonitoring.cpp:733         Method* m =
                    vfst.method();
                    (not just this file)

                Done!

                    ####
                    HeapMonitorThreadOnOffTest.java:77
                    I would make g_tmp volatile, otherwise the
                    assignment in loop may
                    theoretical be skipped.

                Also done!

                Thanks again!
                Jc

Re: JDK-8171119: Low-Overhead Heap Profiling

Reply via email to