On Thu, 24 Mar 2022 17:43:31 GMT, XenoAmess <[email protected]> wrote:
>> 8186958: Need method to create pre-sized HashMap
>
> XenoAmess has updated the pull request incrementally with two additional
> commits since the last revision:
>
> - update jmh
> - refine javadoc; refine implement when expectedSize < 0
OK, finally got some time to look at this. Here's a rewrite of the spec words,
at least for HashMap::newHashMap. If this settles down, I'll write the CSR for
this and LHM and WHM.
/**
* Creates a new, empty HashMap suitable for the expected number of
mappings.
* The returned map uses the default load factor of 0.75, and its initial
capacity is
* generally large enough so that the expected number of mappings can be
added
* without resizing the map.
*
* @param numMappings the expected number of mappings
* @param <K> the type of keys maintained by this map
* @param <V> the type of mapped values
* @return the newly created map
* @throws IllegalArgumentException if numMappings is negative
* @since 19
*/
The original wording was taken from CHM, which generally is a reasonable thing
to do. Unfortunately, it's wrong. :-) "Table size" isn't accurate; HashMap uses
"capacity" as the term for the number of buckets (== length of the internal
table array). "Size" means "number of mappings" so its use of "table size"
confuses these two concepts. The CHM wording also uses "elements" which applies
to linear collections; the things inside a map are usually referred to as
"mappings" or "entries". (I guess we should fix up CHM at some point too.)
While "expectedSize" isn't inaccurate, it's not tied to the main text, so I've
renamed it to numMappings.
There are a couple other javadoc style rules operating here. The first sentence
is generally a sentence fragment that is short and descriptive, as it will be
pulled into a summary table. (It's often written as if it were a sentence that
begins "This method..." but those words are elided.) Details are in the rest of
the first paragraph. The text for `@param`, `@return`, and `@throws` are
generally also sentence fragments, and they have no initial capital letter nor
trailing period.
--
On performance and benchmarking: this is a distraction from the main point of
this effort, which is to add an API that allows callers a correct and
convenient way to create a properly-sized HashMap. Any work spent on optimizing
performance is almost certainly wasted.
First, the benchmark: it's entirely unclear what this is measuring. It's
performing the operation 2^31 times, but it sends the result to a black hole so
that the JIT doesn't eliminate the computation. One of the actual results is
0.170 ops/sec. This includes both the operation and the black hole, so we don't
actually have any idea what that result represents. _Maybe_ it's possible to
infer some idea of relative performance of the different operations by
comparing the results. It's certainly plausible that the integer code is faster
than the float or double code. But the benchmark doesn't tell us how long the
actual computation takes.
Second, how significant is the cost of the computation? I'll assert that it's
insignificant. The table length is computed once at HashMap creation time, and
it's amortized over the addition of all the initial entries and subsequent
queries and updates to the HashMap. Any of the computations (whether integer or
floating point) are a handful of nanoseconds. This will be swamped by the first
hashCode computation that causes a cache miss.
Third: I'll stipulate that the integer computation is probably a few ns faster
than the floating-point computation. But the computation is entirely
non-obvious. To make up for it being non-obvious, there's a big comment that
explains it all. It's not worth doing something that increases performance by
an insignificant amount that also requires a big explanation.
Finally, note that most of the callers are already doing a floating-point
computation to compute the desired capacity, and it doesn't seem to be a
problem.
Sorry, you probably spent a bunch of time on this already, but trying to
optimize the performance here just isn't worthwhile. Let's please just stick
with our good old `(int) Math.ceil(numMappings / 0.75)`.
--
There should be regression tests added for the three new methods. I haven't
thought much about this. It might be possible to reuse some of the
infrastructure in the WhiteBoxResizeTest we worked on previously.
--
I think it would be good to include updates to some of the use sites in this
PR. It's often useful to put new APIs into practice. One usually learns
something from the effort. Even though this is a really simple API, looking at
use sites can illuminating, to see how the code reads. This might affect the
method name, for example.
You don't need to update all of the use sites in the JDK, but it would be good
to choose a reasonable sample. Maybe the ones from a single package, or a
handful (like java.lang or java.util). Maybe include
Class::enumConstantDirectory. If this use site is updated, then maybe it will
allow us to get rid of that ConstantDirectoryOptimalCapacity test that we
problem-listed in the previous PR.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7928