Re: JDK-8171119: Low-Overhead Heap Profiling

Boris Ulasevich Thu, 05 Apr 2018 05:54:25 -0700

Hi JC,

  I have just checked on arm32: your patch compiles and runs ok.

As I can see, jtreg agentlib name "-agentlib:HeapMonitor" does notcorrespond to actual library name: libHeapMonitorTest.c ->libHeapMonitorTest.so


Boris

On 04.04.2018 01:54, White, Derek wrote:

Thanks JC,

New patch applies cleanly. Compiles and runs (simple test programs) onaarch64.


  * Derek

*From:* JC Beyler [mailto:[email protected]]
*Sent:* Monday, April 02, 2018 1:17 PM
*To:* White, Derek <[email protected]>

*Cc:* Erik Österlund <[email protected]>;[email protected]; hotspot-compiler-dev<[email protected]>

*Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling

Hi Derek,

I know there were a few things that went in that provoked a mergeconflict. I worked on it and got it up to date. Sadly my lack ofknowledge makes it a full rebase instead of keeping all the history.However, with a newly cloned jdk/hs you should now be able to use:


http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/

The change you are referring to was done with the others so perhaps youwere unlucky and I forgot it in a webrev and fixed it in another? Idon't know but it's been there and I checked, it is here:


http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp.udiff.html

I double checked that tlab_end_offset no longer appears in anyarchitecture (as far as I can tell :)).


Thanks for testing and let me know if you run into any other issues!

Jc

On Fri, Mar 30, 2018 at 4:24 PM White, Derek <[email protected]<mailto:[email protected]>> wrote:


    Hi Jc,

    I’ve been having trouble getting your patch to apply correctly. I
    may have based it on the wrong version.

    In any case, I think there’s a missing update to
    macroAssembler_aarch64.cpp, in MacroAssembler::tlab_allocate(),
    where “JavaThread::tlab_end_offset()” should become
    “JavaThread::tlab_current_end_offset()”.

    This should correspond to the other port’s changes in
    templateTable_<cpu>.cpp files.

    Thanks!
    - Derek

    *From:* hotspot-compiler-dev
    [mailto:[email protected]
    <mailto:[email protected]>] *On Behalf
    Of *JC Beyler
    *Sent:* Wednesday, March 28, 2018 11:43 AM
    *To:* Erik Österlund <[email protected]
    <mailto:[email protected]>>
    *Cc:* [email protected]
    <mailto:[email protected]>; hotspot-compiler-dev
    <[email protected]
    <mailto:[email protected]>>
    *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling

    Hi all,

    I've been working on deflaking the tests mostly and the wording in
    the JVMTI spec.

    Here is the two incremental webrevs:

    http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.5_6/

    http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.06_07/

    Here is the total webrev:

    http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.07/

    Here are the notes of this change:

       - Currently the tests pass 100 times in a row, I am working on
    checking if they pass 1000 times in a row.

       - The default sampling rate is set to 512k, this is what we use
    internally and having a default means that to enable the sampling
    with the default, the user only has to do a enable event/disable
    event via JVMTI (instead of enable + set sample rate).

       - I deprecated the code that was handling the fast path tlab
    refill if it happened since this is now deprecated

           - Though I saw that Graal is still using it so I have to see
    what needs to be done there exactly

    Finally, using the Dacapo benchmark suite, I noted a 1% overhead for
    when the event system is turned on and the callback to the native
    agent is just empty. I got a 3% overhead with a 512k sampling rate
    with the code I put in the native side of my tests.

    Thanks and comments are appreciated,

    Jc

    On Mon, Mar 19, 2018 at 2:06 PM JC Beyler <[email protected]
    <mailto:[email protected]>> wrote:

        Hi all,

        The incremental webrev update is here:

        http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event4_5/

        The full webrev is here:

        http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/

        Major change here is:

           - I've removed the heapMonitoring.cpp code in favor of just
        having the sampling events as per Serguei's request; I still
        have to do some overhead measurements but the tests prove the
        concept can work

                - Most of the tlab code is unchanged, the only major
        part is that now things get sent off to event collectors when
        used and enabled.

           - Added the interpreter collectors to handle interpreter
        execution

           - Updated the name from SetTlabHeapSampling to
        SetHeapSampling to be more generic

           - Added a mutex for the thread sampling so that we can
        initialize an internal static array safely

           - Ported the tests from the old system to this new one

        I've also updated the JEP and CSR to reflect these changes:

        https://bugs.openjdk.java.net/browse/JDK-8194905

        https://bugs.openjdk.java.net/browse/JDK-8171119

        In order to make this have some forward progress, I've removed
        the heap sampling code entirely and now rely entirely on the
        event sampling system. The tests reflect this by using a
        simplified implementation of what an agent could do:

        
http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/libHeapMonitor.c

        (Search for anything mentioning event_storage).

        I have not taken the time to port the whole code we had
        originally in heapMonitoring to this. I hesitate only because
        that code was in C++, I'd have to port it to C and this is for
        tests so perhaps what I have now is good enough?

        As far as testing goes, I've ported all the relevant tests and
        then added a few:

            - Turning the system on/off

            - Testing using various GCs

            - Testing using the interpreter

            - Testing the sampling rate

            - Testing with objects and arrays

            - Testing with various threads

        Finally, as overhead goes, I have the numbers of the system off
        vs a clean build and I have 0% overhead, which is what we'd
        want. This was using the Dacapo benchmarks. I am now preparing
        to run a version with the events on using dacapo and will report
        back here.

        Any comments are welcome :)

        Jc

        On Thu, Mar 8, 2018 at 4:00 PM JC Beyler <[email protected]
        <mailto:[email protected]>> wrote:

            Hi all,

            I apologize for the delay but I wanted to add an event
            system and that took a bit longer than expected and I also
            reworked the code to take into account the deprecation of
            FastTLABRefill.

            This update has four parts:

            A) I moved the implementation from Thread to
            ThreadHeapSampler inside of Thread. Would you prefer it as a
            pointer inside of Thread or like this works for you? Second
            question would be would you rather have an association
            outside of Thread altogether that tries to remember when
            threads are live and then we would have something like:

            ThreadHeapSampler::get_sampling_size(this_thread);

            I worry about the overhead of this but perhaps it is not too
            too bad?

            B) I also have been working on the Allocation event system
            that sends out a notification at each sampled event. This
            will be practical when wanting to do something at the
            allocation point. I'm also looking at if the whole
            heapMonitoring code could not reside in the agent code and
            not in the JDK. I'm not convinced but I'm talking to Serguei
            about it to see/assess :)

                - Also added two tests for the new event subsystem

            C) Removed the slow_path fields inside the TLAB code since
            now FastTLABRefill is deprecated

            D) Updated the JVMTI documentation and specification for the
            methods.

            So the incremental webrev is here:

            http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.09_10/

            and the full webrev is here:

            http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.10

            I believe I have updated the various JIRA issues that track
            this :)

            Thanks for your input,

            Jc

            On Wed, Feb 14, 2018 at 10:34 PM, JC Beyler
            <[email protected] <mailto:[email protected]>> wrote:

                Hi Erik,

                I inlined my answers, which the last one seems to answer
                Robbin's concerns about the same thing (adding things to
                Thread).

                On Wed, Feb 14, 2018 at 2:51 AM, Erik Österlund
                <[email protected]
                <mailto:[email protected]>> wrote:

                    Hi JC,

                    Comments are inlined below.

                    On 2018-02-13 06:18, JC Beyler wrote:

                        Hi Erik,

                        Thanks for your answers, I've now inlined my own
                        answers/comments.

                        I've done a new webrev here:

                        http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/
                        
<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.08/>

                        The incremental is here:

                        
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
                        
<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>

                        Note to all:

                           - I've been integrating changes from
                        Erin/Serguei/David comments so this webrev
                        incremental is a bit an answer to all comments
                        in one. I apologize for that :)

                        On Mon, Feb 12, 2018 at 6:05 AM, Erik Österlund
                        <[email protected]
                        <mailto:[email protected]>> wrote:

                            Hi JC,

                            Sorry for the delayed reply.

                            Inlined answers:



                            On 2018-02-06 00:04, JC Beyler wrote:

                                Hi Erik,

                                (Renaming this to be folded into the
                                newly renamed thread :))

                                First off, thanks a lot for reviewing
                                the webrev! I appreciate it!

                                I updated the webrev to:
                                
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/
                                
<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/>

                                And the incremental one is here:
                                
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/
                                
<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.04_05a/>

                                It contains:
                                - The change for since from 9 to 11 for
                                the jvmti.xml
                                - The use of the OrderAccess for initialized
                                - Clearing the oop

                                I also have inlined my answers to your
                                comments. The biggest question
                                will come from the multiple *_end
                                variables. A bit of the logic there
                                is due to handling the slow path refill
                                vs fast path refill and
                                checking that the rug was not pulled
                                underneath the slowpath. I
                                believe that a previous comment was that
                                TlabFastRefill was going to
                                be deprecated.

                                If this is true, we could revert this
                                code a bit and just do a : if
                                TlabFastRefill is enabled, disable this.
                                And then deprecate that when
                                TlabFastRefill is deprecated.

                                This might simplify this webrev and I
                                can work on a follow-up that
                                either: removes TlabFastRefill if Robbin
                                does not have the time to do
                                it or add the support to the assembly
                                side to handle this correctly.
                                What do you think?

                            I support removing TlabFastRefill, but I
                            think it is good to not depend on that
                            happening first.


                        I'm slowly pushing on the FastTLABRefill
                        (https://bugs.openjdk.java.net/browse/JDK-8194084),
                        I agree on keeping both separate for now though
                        so that we can think of both differently

                                Now, below, inlined are my answers:

                                On Fri, Feb 2, 2018 at 8:44 AM, Erik
                                Österlund
                                <[email protected]
                                <mailto:[email protected]>> wrote:

                                    Hi JC,

                                    Hope I am reviewing the right
                                    version of your work. Here goes...

                                    
src/hotspot/share/gc/shared/collectedHeap.inline.hpp:

159 AllocTracer::send_allocation_outside_tlab(klass, result, size *

                                    HeapWordSize, THREAD);
                                       160

161 THREAD->tlab().handle_sample(THREAD, result, size);

                                       162     return result;
                                       163   }

                                    Should not call tlab()->X without
                                    checking if (UseTLAB) IMO.

                                Done!


                            More about this later.

                                    
src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:

                                    So first of all, there seems to
                                    quite a few ends. There is an "end",
                                    a "hard
                                    end", a "slow path end", and an
                                    "actual end". Moreover, it seems
                                    like the
                                    "hard end" is actually further away
                                    than the "actual end". So the "hard end"
                                    seems like more of a "really
                                    definitely actual end" or something.
                                    I don't
                                    know about you, but I think it looks
                                    kind of messy. In particular, I don't
                                    feel like the name "actual end"
                                    reflects what it represents,
                                    especially when
                                    there is another end that is behind
                                    the "actual end".

                                       413 HeapWord*
                                    ThreadLocalAllocBuffer::hard_end() {
                                       414   // Did a fast TLAB refill
                                    occur?
                                       415   if (_slow_path_end != _end) {
                                       416     // Fix up the actual end
                                    to be now the end of this TLAB.
                                       417     _slow_path_end = _end;
                                       418     _actual_end = _end;
                                       419   }
                                       420
                                       421   return _actual_end +
                                    alignment_reserve();
                                       422 }

                                    I really do not like making getters
                                    unexpectedly have these kind of side
                                    effects. It is not expected that
                                    when you ask for the "hard end", you
                                    implicitly update the "slow path
                                    end" and "actual end" to new values.

                                As I said, a lot of this is due to the
                                FastTlabRefill. If I make this
                                not supporting FastTlabRefill, this goes
                                away. The reason the system
                                needs to update itself at the get is
                                that you only know at that get if
                                things have shifted underneath the tlab
                                slow path. I am not sure of
                                really better names (naming is hard!),
                                perhaps we could do these
                                names:

                                - current_tlab_end       // Either the
                                allocated tlab end or a sampling point
                                - last_allocation_address  // The end of
                                the tlab allocation
                                - last_slowpath_allocated_end  // In
                                case a fast refill occurred the
                                end might have changed, this is to
                                remember slow vs fast past refills

                                the hard_end method can be renamed to
                                something like:
                                tlab_end_pointer()        // The end of
                                the lab including a bit of
                                alignment reserved bytes

                            Those names sound better to me. Could you
                            please provide a mapping from the old names
                            to the new names so I understand which one
                            is which please?

                            This is my current guess of what you are
                            proposing:

                            end -> current_tlab_end
                            actual_end -> last_allocation_address
                            slow_path_end -> last_slowpath_allocated_end
                            hard_end -> tlab_end_pointer

                        Yes that is correct, that was what I was proposing.

                            I would prefer this naming:

                            end -> slow_path_end // the end for taking a
                            slow path; either due to sampling or refilling
                            actual_end -> allocation_end // the end for
                            allocations
                            slow_path_end -> last_slow_path_end // last
                            address for slow_path_end (as opposed to
                            allocation_end)
                            hard_end -> reserved_end // the end of the
                            reserved space of the TLAB

                            About setting things in the getter... that
                            still seems like a very unpleasant thing to
                            me. It would be better to inspect the call
                            hierarchy and explicitly update the ends
                            where they need updating, and assert in the
                            getter that they are in sync, rather than
                            implicitly setting various ends as a
                            surprising side effect in a getter. It looks
                            like the call hierarchy is very small. With
                            my new naming convention, reserved_end()
                            would presumably return _allocation_end +
                            alignment_reserve(), and have an assert
                            checking that _allocation_end ==
                            _last_slow_path_allocation_end, complaining
                            that this invariant must hold, and that a
                            caller to this function, such as
                            make_parsable(), must first explicitly
                            synchronize the ends as required, to honor
                            that invariant.


                        I've renamed the variables to how you preferred
                        it except for the _end one. I did:

                        current_end

                        last_allocation_address

                        tlab_end_ptr

                        The reason is that the architecture dependent
                        code use the thread.hpp API and it already has
                        tlab included into the name so it becomes
                        tlab_current_end (which is better that
                        tlab_current_tlab_end in my opinion).

                        I also moved the update into a separate method
                        with a TODO that says to remove it when
                        FastTLABRefill is deprecated

                    This looks a lot better now. Thanks.

                    Note that the following comment now needs updating
                    accordingly in threadLocalAllocBuffer.hpp:

                       41 //            Heap sampling is performed via
                    the end/actual_end fields.

                       42 //            actual_end contains the real end
                    of the tlab allocation,

                       43 //            whereas end can be set to an
                    arbitrary spot in the tlab to

                       44 //            trip the return and sample the
                    allocation.

                       45 //            slow_path_end is used to track
                    if a fast tlab refill occured

                       46 //            between slowpath calls.

                    There might be other comments too, I have not looked
                    in detail.

                This was the only spot that still had an actual_end, I
                fixed it now. I'll do a sweep to double check other
                comments.



                                Not sure it's better but before updating
                                the webrev, I wanted to try
                                to get input/consensus :)

                                (Note hard_end was always further off
                                than end).

                                    src/hotspot/share/prims/jvmti.xml:

                                    10357       <capabilityfield
                                    id="can_sample_heap" since="9">
                                    10358         <description>
                                    10359           Can sample the heap.
                                    10360           If this capability
                                    is enabled then the heap sampling
                                    methods
                                    can be called.
                                    10361         </description>
                                    10362       </capabilityfield>

                                    Looks like this capability should
                                    not be "since 9" if it gets integrated
                                    now.

                                Updated now to 11, crossing my fingers :)

                                    
src/hotspot/share/runtime/heapMonitoring.cpp:

                                       448       if
                                    (is_alive->do_object_b(value)) {
                                       449         // Update the oop to
                                    point to the new object if it is still
                                    alive.
                                       450         f->do_oop(&(trace.obj));
                                       451
                                       452         // Copy the old
                                    trace, if it is still live.

453 _allocated_traces->at_put(curr_pos++, trace);

                                       454
                                       455         // Store the live
                                    trace in a cache, to be served up on
                                    /heapz.

456 _traces_on_last_full_gc->append(trace);

                                       457
                                       458         count++;
                                       459       } else {
                                       460         // If the old trace
                                    is no longer live, add it to the list of
                                       461         // recently collected
                                    garbage.

462 store_garbage_trace(trace);

                                       463       }

                                    In the case where the oop was not
                                    live, I would like it to be explicitly
                                    cleared.

                                Done I think how you wanted it. Let me
                                know because I'm not familiar
                                with the RootAccess API. I'm unclear if
                                I'm doing this right or not so
                                reviews of these parts are highly
                                appreciated. Robbin had talked of
                                perhaps later pushing this all into a
                                OopStorage, should I do this now
                                do you think? Or can that wait a second
                                webrev later down the road?

                            I think using handles can and should be done
                            later. You can use the Access API now.
                            I noticed that you are missing an #include
                            "oops/access.inline.hpp" in your
                            heapMonitoring.cpp file.

                        The missing header is there for me so I don't
                        know, I made sure it is present in the latest
                        webrev. Sorry about that.

                                + Did I clear it the way you wanted me
                                to or were you thinking of
                                something else?


                            That is precisely how I wanted it to be
                            cleared. Thanks.

                                + Final question here, seems like if I
                                were to want to not do the
                                f->do_oop directly on the trace.obj, I'd
                                need to do something like:

                                     f->do_oop(&value);
                                     ...
                                     trace->store_oop(value);

                                to update the oop internally. Is that
                                right/is that one of the
                                advantages of going to the Oopstorage
                                sooner than later?


                            I think you really want to do the do_oop on
                            the root directly. Is there a particular
                            reason why you would not want to do that?
                            Otherwise, yes - the benefit with using the
                            handle approach is that you do not need to
                            call do_oop explicitly in your code.

                        There is no reason except that now we have a
                        load_oop and a get_oop_addr, I was not sure what
                        you would think of that.

                    That's fine.

                                    Also I see a lot of
                                    concurrent-looking use of the
                                    following field:
                                       267   volatile bool _initialized;

                                    Please note that the "volatile"
                                    qualifier does not help with reordering
                                    here. Reordering between volatile
                                    and non-volatile fields is
                                    completely free
                                    for both compiler and hardware,
                                    except for windows with MSVC, where
                                    volatile
                                    semantics is defined to use
                                    acquire/release semantics, and the
                                    hardware is
                                    TSO. But for the general case, I
                                    would expect this field to be stored
                                    with
                                    OrderAccess::release_store and
                                    loaded with OrderAccess::load_acquire.
                                    Otherwise it is not thread safe.

                                Because everything is behind a mutex, I
                                wasn't really worried about
                                this. I have a test that has multiple
                                threads trying to hit this
                                corner case and it passes.

                                However, to be paranoid, I updated it to
                                using the OrderAccess API
                                now, thanks! Let me know what you think
                                there too!


                            If it is indeed always supposed to be read
                            and written under a mutex, then I would
                            strongly prefer to have it accessed as a
                            normal non-volatile member, and have an
                            assertion that given lock is held or we are
                            in a safepoint, as we do in many other
                            places. Something like this:

                            assert(HeapMonitorStorage_lock->owned_by_self()
                            || (SafepointSynchronize::is_at_safepoint()
                            && Thread::current()->is_VM_thread()), "this
                            should not be accessed concurrently");

                            It would be confusing to people reading the
                            code if there are uses of OrderAccess that
                            are actually always protected under a mutex.

                        Thank you for the exact example to be put in the
                        code! I put it around each access/assignment of
                        the _initialized method and found one case where
                        yes you can touch it and not have the lock. It
                        actually is "ok" because you don't act on the
                        storage until later and only when you really
                        want to modify the storage (see the
                        object_alloc_do_sample method which calls the
                        add_trace method).

                        But, because of this, I'm going to put the
                        OrderAccess here, I'll do some performance
                        numbers later and if there are issues, I might
                        add a "unsafe" read and a "safe" one to make it
                        explicit to the reader. But I don't think it
                        will come to that.


                    Okay. This double return in heapMonitoring.cpp looks
                    wrong:

                      283   bool initialized() {
                      284     return
                    OrderAccess::load_acquire(&_initialized) != 0;
                      285     return _initialized;
                      286   }

                    Since you said object_alloc_do_sample() is the only
                    place where you do not hold the mutex while reading
                    initialized(), I had a closer look at that. It looks
                    like in its current shape, the lack of a mutex may
                    lead to a memory leak. In particular, it first
                    checks if (initialized()). Let's assume this is now
                    true. It then allocates a bunch of stuff, and checks
                    if the number of frames were over 0. If they were,
                    it calls StackTraceStorage::storage()->add_trace()
                    seemingly hoping that after grabbing the lock in
                    there, initialized() will still return true. But it
                    could now return false and skip doing anything, in
                    which case the allocated stuff will never be freed.

                I fixed this now by making add_trace return a boolean
                and checking for that. It will be in the next webrev.
                Thanks, the truth is that in our implementation the
                system is always on or off, so this never really occurs
                :). In this version though, that is not true and it's
                important to handle so thanks again!


                    So the analysis seems to be that _initialized is
                    only used outside of the mutex in once instance,
                    where it is used to perform double-checked locking,
                    that actually causes a memory leak.

                    I am not proposing how to fix that, just raising the
                    issue. If you still want to perform this
                    double-checked locking somehow, then the use of
                    acquire/release still seems odd. Because the memory
                    ordering restrictions of it never comes into play in
                    this particular case. If it ever did, then the use
                    of destroy_stuff(); release_store(_initialized, 0)
                    would be broken anyway as that would imply that
                    whatever concurrent reader there ever was would
                    after reading _initialized with load_acquire() could
                    *never* read the data that is concurrently destroyed
                    anyway. I would be biased to think that
                    RawAccess<MO_RELAXED>::load/store looks like a more
                    appropriate solution, given that the memory leak
                    issue is resolved. I do not know how painful it
                    would be to not perform this double-checked locking.

                So I agree with this entirely. I looked also a bit more
                and the difference and code really stems from our
                internal version. In this version however, there are
                actually a lot of things going on that I did not go
                entirely through in my head but this comment made me
                ponder a bit more on it.

                Since every object_alloc_do_sample is protected by a
                check to HeapMonitoring::enabled(), there is only a
                small chance that the call is happening when things have
                been disabled. So there is no real need to do a first
                check on the initialized, it is a rare occurence that a
                call happens to object_alloc_do_sample and the
                initialized of the storage returns false.

                (By the way, even if you did call object_alloc_do_sample
                without looking at HeapMonitoring::enabled(), that would
                be ok too. You would gather the stacktrace and get
                nowhere at the add_trace call, which would return false;
                so though not optimal performance wise, nothing would
                break).

                Furthermore, the add_trace is really the moment of no
                return and we have the mutex lock and then the
                initialized check. So, in the end, I did two things: I
                removed that first check and then I removed the
                OrderAccess for the storage initialized. I think now I
                have a better grasp and understanding why it was done in
                our code and why it is not needed here. Thanks for
                pointing it out :). This now still passes my JTREG
                tests, especially the threaded one.



                                    As a kind of meta comment, I wonder
                                    if it would make sense to add sampling
                                    for non-TLAB allocations. Seems like
                                    if someone is rapidly allocating a
                                    whole bunch of 1 MB objects that
                                    never fit in a TLAB, I might still be
                                    interested in seeing that in my
                                    traces, and not get surprised that the
                                    allocation rate is very high yet not
                                    showing up in any profiles.

                                That is handled by the handle_sample
                                where you wanted me to put a
                                UseTlab because you hit that case if the
                                allocation is too big.


                            I see. It was not obvious to me that
                            non-TLAB sampling is done in the TLAB class.
                            That seems like an abstraction crime.
                            What I wanted in my previous comment was
                            that we do not call into the TLAB when we
                            are not using TLABs. If there is sampling
                            logic in the TLAB that is used for something
                            else than TLABs, then it seems like that
                            logic simply does not belong inside of the
                            TLAB. It should be moved out of the TLAB,
                            and instead have the TLAB call this common
                            abstraction that makes sense.

                        So in the incremental version:

                        
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
                        
<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>,
                        this is still a "crime". The reason is that the
                        system has to have the bytes_until_sample on a
                        per-thread level and it made "sense" to have it
                        with the TLAB implementation. Also, I was not
                        sure how people felt about adding something to
                        the thread instance instead.

                        Do you think it fits better at the Thread level?
                        I can see how difficult it is to make it happen
                        there and add some logic there. Let me know what
                        you think.


                    We have an unfortunate situation where everyone that
                    has some fields that are thread local tend to dump
                    them right into Thread, making the size and
                    complexity of Thread grow as it becomes tightly
                    coupled with various unrelated subsystems. It would
                    be desirable to have a separate class for this
                    instead that encapsulates the sampling logic. That
                    class could possibly reside in Thread though as a
                    value object of Thread.

                I imagined that would be the case but was not sure. I
                will look at the example that Robbin is talking about
                (ThreadSMR) and will see how to refactor my code to use
                that.

                Thanks again for your help,

                Jc



                            Hope I have answered your questions and that
                            my feedback makes sense to you.

                        You have and thank you for them, I think we are
                        getting to a cleaner implementation and things
                        are getting better and more readable :)


                    Yes it is getting better.

                    Thanks,
                    /Erik



                        Thanks for your help!

                        Jc

                            Thanks,
                            /Erik

                                I double checked by changing the test
                                
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java
                                
<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java>

                                to use a smaller Tlab (2048) and made
                                the object bigger and it goes
                                through that and passes.

                                Thanks again for your review and I look
                                forward to your pointers for
                                the questions I now have raised!
                                Jc







                                    Thanks,
                                    /Erik


                                    On 2018-01-26 06:45, JC Beyler wrote:

                                        Thanks Robbin for the reviews :)

                                        The new full webrev is here:
                                        
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.03/
                                        
<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.03/>
                                        The incremental webrev is here:
                                        
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.02_03/
                                        
<http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.02_03/>

                                        I inlined my answers:

                                        On Thu, Jan 25, 2018 at 1:15 AM,
                                        Robbin Ehn
                                        <[email protected]
                                        <mailto:[email protected]>>
                                        wrote:

                                            Hi JC, great to see another
                                            revision!

                                            ####
                                            heapMonitoring.cpp

                                            StackTraceData should not
                                            contain the oop for 'safety'
                                            reasons.
                                            When StackTraceData is moved
                                            from _allocated_traces:
                                            L452 store_garbage_trace(trace);
                                            it contains a dead oop.
                                            _allocated_traces could
                                            instead be a tupel of oop
                                            and StackTraceData thus
                                            dead oops are not kept.

                                        Done I used inheritance to make
                                        the copier work regardless but the
                                        idea is the same.

                                            You should use the new
                                            Access API for loading the
                                            oop, something like
                                            this:
                                            RootAccess<ON_PHANTOM_OOP_REF |
                                            AS_NO_KEEPALIVE>::load(...)
                                            I don't think you need to
                                            use Access API for clearing
                                            the oop, but it
                                            would
                                            look nicer. And you
                                            shouldn't probably be using:
                                            
Universe::heap()->is_in_reserved(value)

                                        I am unfamiliar with this but I
                                        think I did do it like you wanted me
                                        to (all tests pass so that's a
                                        start). I'm not sure how to
                                        clear the
                                        oop exactly, is there somewhere
                                        that does that, which I can use
                                        to do
                                        the same?

                                        I removed the is_in_reserved,
                                        this came from our internal
                                        version, I
                                        don't know why it was there but
                                        my tests work without so I
                                        removed it
                                        :)

                                            The lock:
                                            L424   MutexLocker
                                            mu(HeapMonitorStorage_lock);
                                            Is not needed as far as I
                                            can see.
                                            weak_oops_do is called in a
                                            safepoint, no TLAB
                                            allocation can happen and
                                            JVMTI thread can't access
                                            these data-structures. Is
                                            there something more
                                            to
                                            this lock that I'm missing?

                                        Since a thread can call the
                                        JVMTI getLiveTraces (or any of
                                        the other
                                        ones), it can get to the point
                                        of trying to copying the
                                        _allocated_traces. I imagine it
                                        is possible that this is happening
                                        during a GC or that it can be
                                        started and a GC happens afterwards.
                                        Therefore, it seems to me that
                                        you want this protected, no?

                                            ####
                                            You have 6 files without any
                                            changes in them (any more):
                                            g1CollectedHeap.cpp
                                            psMarkSweep.cpp
                                            psParallelCompact.cpp
                                            genCollectedHeap.cpp
                                            referenceProcessor.cpp
                                            thread.hpp

                                        Done.

                                            ####
                                            I have not looked closely,
                                            but is it possible to hide
                                            heap sampling in
                                            AllocTracer ? (with some
                                            minor changes to the
                                            AllocTracer API)

                                        I am imagining that you are
                                        saying to move the code that
                                        does the
                                        sampling code (change the tlab
                                        end, do the call to HeapMonitoring,
                                        etc.) into the AllocTracer code
                                        itself? I think that is right
                                        and I'll
                                        look if that is possible and
                                        prepare a webrev to show what
                                        would be
                                        needed to make that happen.

                                            ####
                                            Minor nit, when declaring
                                            pointer there is a little
                                            mix of having the
                                            pointer adjacent by type
                                            name and data name. (Most
                                            hotspot code is by
                                            type
                                            name)
                                            E.g.

heapMonitoring.cpp:711 jvmtiStackTrace *trace = ....heapMonitoring.cpp:733 Method* m = vfst.method();

                                            (not just this file)

                                        Done!

                                            ####
                                            HeapMonitorThreadOnOffTest.java:77
                                            I would make g_tmp volatile,
                                            otherwise the assignment in
                                            loop may
                                            theoretical be skipped.

                                        Also done!

                                        Thanks again!
                                        Jc

Re: JDK-8171119: Low-Overhead Heap Profiling

Reply via email to