Which String deduplication option?

Christopher Mon, 01 Feb 2021 18:55:21 -0800

While code reviewing, I saw that
core/src/main/java/org/apache/accumulo/core/clientImpl/TabletLocator.java
was using a WeakHashMap to deduplicate some strings.


This code can probably be removed in favor of one of the following two options:

1. Just explicitly use String.intern() - As of Java 7, there is no
longer a separate, fixed-size PermGen space, so intern'd strings will
be in the main heap, no longer constrained to a limited size pool.
These strings are still subject to garbage collection. It is
implemented as a HashMap internally (native implementation), with a
default bucket size of more than 60K, plenty big enough for the
interning that TabletLocator is doing... but this is configurable by
the user with JVM flags if it's not. Interning will use less memory as
WeakHashMap and similar performance, as long as the bucket size is big
enough.

2. Just use -XX:+UseStringDeduplication JVM flag - as of Java 9, G1 is
the new default Java garbage collector. This garbage collector has the
option to automatically attempt to deduplicate all strings behind the
scenes, by swapping out their underlying char arrays (so, it likely
won't affect == equality because the String object references
themselves won't change, unlike option 1). This is more passive than
option 1, but would apply to the entire JVM. G1GC also implements some
heuristics to prevent too much overhead.

With both options, it's possible to output statistics.

If I remove the WeakHashMap for the string deduplication in
TabletLocator, does anybody have an opinion on which option I should
replace it with? I'm leaning towards option 2 (adding it to
assemble/conf/accumulo-env.sh as one of the default flags).

Which String deduplication option?

Reply via email to