On Fri, 27 Sep 2024 09:52:07 GMT, Afshin Zafari <[email protected]> wrote:
> > What is the consensus on having two implementations alive at the same time?
> > I'd like to see us delete the old VirtualMemoryTracker and only have the
> > tree implementation left as part of this PR. What is the status of our
> > testing? Is the tree version equivalent now and passes all tests?
>
> I have already told that the main reason to have 2 versions is to be able to
> switch back from new to old version at customer site at runtime. This is what
> Thomas has requested.
The issue with having two implementations is that it requires adding a new
`java` option, and deprecating and removing those take 3 releases. That's 18
months of us being required to have the old version in the codebase, and being
required to maintain it. I don't think that's a negligible promise to make.
Rather than having 2 implementations, I'd like to see us aiming for integration
for JDK-25 after forking 24, so integration in December. That would give us 6
months of ensuring stability of the new implementation before it reaches any
customers. During those 6 months it will go through all of our testing multiple
times over, and be used by OpenJDK developers. What do you think of this idea?
> During the implementation, it is much beneficial to have access to the old
> version to compare the results and debug problems when they occur. Until the
> whole port to VMATree is not done, we need this 2 versions side-by-side
> feature.
Sure, I can understand that it's nice to have both versions present during
development. Right now it seems like we have a broken build, do you have any
plans on having a functioning and fully featured build soon?
>> src/hotspot/share/nmt/regionsTree.hpp line 31:
>>
>>> 29: #include "nmt/vmatree.hpp"
>>> 30:
>>> 31: class RegionsTree : public VMATree {
>>
>> Is it necessary for this to be a superclass? Seems like this class contains
>> utilities for working with a `VMATree`. Could it instead be `AllStatic` and
>> take a tree as its first argument? I'd like to avoid inheritance unless
>> necessary (typically for the usage of virtual functions).
>
> It is inherited because it is not a new type. I think that
> `_tree->visit_reserved_region(par1, par2)` is more readable than
> `VMATree::visit_reserved_region(_tree, par1, par2)`
> I could not find any *virtual functions* in these classes, what do you mean
> by that?
I'm saying that inheritance is mostly useful when we have virtual functions
which we can specialize, and that `VMATree` has none.
>> src/hotspot/share/nmt/regionsTree.hpp line 33:
>>
>>> 31: class RegionsTree : public VMATree {
>>> 32: private:
>>> 33: NativeCallStackStorage* _ncs_storage;
>>
>> No need for this to be a pointer.
>
> How to use the ctor with a `bool` arg, then?
The bool argument is just passed along.
```c++
RegionsTree(bool with_storage) : VMATree(), _ncs_storage(with_storage) {
}
>> src/hotspot/share/nmt/regionsTree.hpp line 108:
>>
>>> 106: CommittedMemoryRegion cmr((address)base, comm_size, st);
>>> 107: comm_size = 0;
>>> 108: prev.clear_node();
>>
>> This is unneeded.
>
> Which part? Why?
> `clear_node()` is the same as `prev = nullptr` if pointers were used.
> `is_valid()` is the same as `prev == nullptr` if pointers were used.
Because you immediately reassign `prev` to `curr` in L114 it's not necessary to
first call `clear_node`.
>> src/hotspot/share/nmt/virtualMemoryTrackerWithTree.cpp line 101:
>>
>>> 99: }
>>> 100:
>>> 101: void
>>> VirtualMemoryTrackerWithTree::apply_summary_diff(VMATree::SummaryDiff diff)
>>> {
>>
>> Applying a summary diff in the MemoryFileTracker is this:
>>
>> ```c++
>> for (int i = 0; i < mt_number_of_types; i++) {
>> VirtualMemory* summary =
>> file->_summary.by_type(NMTUtil::index_to_flag(i));
>> summary->reserve_memory(diff.flag[i].commit);
>> summary->commit_memory(diff.flag[i].commit);
>> }
>>
>>
>> This is much simpler and doesn't require looking at signs and so on.
>
> In `MemoryFileTracker`, the changes in commit/reserve are applied to a local
> `VirtualMemorySnapshot`. Here we have to apply them to the global
> `VirtualMemorySummary`.
Yeah, that doesn't seem like a problem.
```c++
for (int i = 0; i < mt_number_of_types; i++) {
r = diff.flag[i].reserve;
c = diff.flag[i].commit;
flag = NMTUtil::index_to_flag(i);
VirtualMemory* mem = VirtualMemorySummary::as_snapshot()->by_type(flag);
reserved = mem->reserved();
committed = mem->committed();
mem->reserve_memory(r);
mem->commit_memory(c);
if ((size_t)-r > reserved) {
print_err("release");
}
if ((size_t)-c > reserved || (size_t)c > committed) {
print_err("uncommit");
}
}
>> src/hotspot/share/nmt/virtualMemoryTrackerWithTree.hpp line 34:
>>
>>> 32: #include "utilities/ostream.hpp"
>>> 33:
>>> 34: class VirtualMemoryTrackerWithTree : AllStatic {
>>
>> Can this class be redone to use the same pattern as `MemoryFileTracker` with
>> an outer class that's not static and an inner class that's stack, storing an
>> instance of the outer class? That makes testing the class by creating an
>> instance of it possible, unlike our old tests.
>
> Can we do this change as a separate RFE? It impacts the code a lot.
Then just invert it: Have the outer class be static and the inner class be an
instance. We can change the `MemoryFileTracker` to be that, as it's not as
large of a change.
>> src/hotspot/share/nmt/vmatree.hpp line 59:
>>
>>> 57: // Bit fields view: bit 0 for Reserved, bit 1 for Committed.
>>> 58: // Setting a region as Committed preserves the Reserved state.
>>> 59: enum class StateType : uint8_t { Reserved = 1, Committed = 3,
>>> Released = 0, COUNT = 4 };
>>
>> Why do we now consider `StateType` to be a bit field?
>
> As the comments above it says, and we already had a discussion on it: The
> 'Committed' state now means 'Reserved and Committed'. So, when we visit nodes
> and check for Reserved ones, the committed nodes are also considered as
> Reserved. Otherwise, we would ignore the Committed nodes and continue looking
> for Reserved nodes.
Right, I don't remember the discussion on this being a bit field. I agree,
though, Committed also means Reserved, but it's not necessary to express that
as a bit field. Regardless, I leave it up to you how we express this.
>> test/hotspot/gtest/nmt/test_nmt_treap.cpp line 168:
>>
>>> 166: tty->print_cr("d1: %lf, d2: %lf, d2/d1: %lf", d1, d2, d2 / d1);
>>> 167: EXPECT_LT((int)(d2 / d1), N2 / N1);
>>> 168: }
>>
>> 1000 and 10_000 elements are both quite small, isn't one measurement prone
>> to being slower than expected due to unforeseen circumstances? For example,
>> if `malloc` performs a heavy allocation (by mapping in a few new pages)
>> during the 10_000 elements insertion then that may stall enough that the
>> `EXPECT_LT` fails. We are also bound to only performing the test once.
>>
>> Is your thinking that these perf tests should be kept in, or should they be
>> removed before integrating?
>
> Generally, I think it is necessary to have performance tests to verify if any
> future changes do not degrade the performance.
> The approach of stress testing can be agreed between us, but the tests have
> to exist.
> In this approach, the input size is scaled N times and we expect the
> execution time grows as O(log(N)) which is much less than N.
> This test fails and we need to have a justification for it. If approach is
> invalid or not correct implemented, we fix it. But the expectations remains
> the same.
Why would the execution time grow logarithmically when we do linearly more
work? When we run this with `N2` we will perform `10_000 * log(10_000, 2)`
units of work, and for `N1` we will perform `1_000 * log(1_000, 2)` units of
work, approximately. A closer approximation is `O(log(1)) + O(log(2)) + ... +
O(log(n))` for `n = N2` or `n = N1`.
Let's calculate that:
>>> import math
>>> def time_it(n):
... t = 0
... for i in range(1, n):
... t = t + math.log(i)
... return [t, 3*t] # Approximate best and worst case
...
>>> time_it(1000)
[5905.220423209189, 17715.661269627566]
>>> time_it(10_000)
[82099.71749644217, 246299.15248932652]
>>> def compare(a, b):
... ab, aw = a
... bb, bw = b
... return [ bb / ab, bw / aw ]
...
>>> compare(time_it(1000), time_it(10_000))
[13.902904821938064, 13.902904821938066]
>>> # Ouch, that's outside of our linear bound!
What I said previously also applies, especially the performance of `malloc`
will impact this I suspect.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20425#issuecomment-2378929190
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1711105506
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1703735921
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1799085027
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1703751887
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1703739805
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1776820787
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1703877183