Guava's argument in the linked comment appears to be based on pre-Java8, before the PermGen space was consolidated with the main heap and had a fixed size.
In response to the other observations: a stress test here seems particularly difficult to compare between String.intern and G1GC deduplication, since the latter will deduplicate across the JVM, and not just the TabletLocator stuff. So, we wouldn't be able to get a good direct comparison between the overall impact between the two. As for running a stress test between WeakHashMap and String.intern, just for TabletLocator, I'm not going to bother because others have already done that work and determined them to be similar in performance, given an adequate string table size (one is at http://java-performance.info/string-intern-in-java-6-7-8/) So, what I will do is create a PR to replace the non-tunable WeakHashMap with the tunable String.intern for TabletLocator, on the basis that they have comparable performance, but the latter is user-tunable if they need to, uses less memory, and involves less code. I will not do anything with the G1GC settings, leaving that up to users to experiment with and tune on their own, if they wish. On Mon, Feb 8, 2021 at 4:31 PM David <[email protected]> wrote: > > Guava argues for the use of a weak hashmap. > > https://github.com/google/guava/blob/master/guava/src/com/google/common/collect/Interner.java#L28-L30 > > On Mon, Feb 8, 2021 at 3:57 PM Brian Loss <[email protected]> wrote: > > > It might make sense to do both approaches. > > > > It seems there are limits to when -XX:+UseStringDeduplication takes > > effect. By default, it only interns objects that have survived 3 GC cycles, > > although that number can be changed. If the objects in question are > > short-lived, then it wouldn’t make sense to call String.intern on them > > either. However, if we know based on the usage pattern that we’d get a lot > > of deduplication on long-lived strings, then String.intern is better > > because it will happen right away and will also save more memory since the > > String object itself is de-duped (vs just the internal char array for the > > automatic de-duplication). It wasn’t completely clear from my reading, but > > if I understood correctly the other potential downside to > > UseStringDeduplication is that it happens after GC if there’s time. On a > > heavily loaded system that doesn’t time left in the pause time goal window > > after completing GC, the string de-duplication might not happen at all. > > > > Adding -XX:+UseStringDeduplication wouldn’t hurt and could potentially > > provide some benefit, so I’d be in favor of adding it. For TabletLocator > > specifically, if we know that’s an area where string de-duplication will > > help, then we should probably use String.intern there. As Keith suggested, > > a stress test might help determine whether it makes sense. In the absence > > of that, if we assume the previous WeakHashMap was there to solve a > > specific problem (vs an uninformed attempt to save memory) then > > String.intern sounds to me like the way to go as well. > > > > > On Feb 8, 2021, at 3:34 PM, Dave Marion <[email protected]> wrote: > > > > > > String.intern() would seem to provide better coverage considering that > > some > > > users may not use the G1 collector. > > > > > > On Mon, Feb 8, 2021 at 3:19 PM Keith Turner <[email protected]> wrote: > > > > > >> Recently while running some large map reduce jobs I learned that > > >> Hadoop uses String.intern() in its RPC code (below is a link to an > > >> example on one place where Hadoop does this). I learned this because > > >> when I ran jstack on NN, RM, and/or AM that were under distress > > >> sometimes I kept seeing RPC server threads that were in > > >> String.intern(). I never was quite sure if it was a problem though. > > >> Not saying String.intern() is bad or good, just sharing something I > > >> observed that I was uncertain about. > > >> > > >> May make sense to create some sort of stress test that could simulate > > >> the usage pattern of the TabletLocator and try the different options > > >> and see what happens. If any long pauses or problems happen in the > > >> simulation, they may happen in the real environment. > > >> > > >> > > >> > > https://github.com/apache/hadoop/blob/ba631c436b806728f8ec2f54ab1e289526c90579/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/TaskStatus.java#L481 > > >> > > >> > > https://github.com/apache/hadoop/blob/ba631c436b806728f8ec2f54ab1e289526c90579/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringInterner.java#L67 > > >> > > >> On Mon, Feb 1, 2021 at 9:55 PM Christopher <[email protected]> wrote: > > >>> > > >>> While code reviewing, I saw that > > >>> > > core/src/main/java/org/apache/accumulo/core/clientImpl/TabletLocator.java > > >>> was using a WeakHashMap to deduplicate some strings. > > >>> > > >>> This code can probably be removed in favor of one of the following two > > >> options: > > >>> > > >>> 1. Just explicitly use String.intern() - As of Java 7, there is no > > >>> longer a separate, fixed-size PermGen space, so intern'd strings will > > >>> be in the main heap, no longer constrained to a limited size pool. > > >>> These strings are still subject to garbage collection. It is > > >>> implemented as a HashMap internally (native implementation), with a > > >>> default bucket size of more than 60K, plenty big enough for the > > >>> interning that TabletLocator is doing... but this is configurable by > > >>> the user with JVM flags if it's not. Interning will use less memory as > > >>> WeakHashMap and similar performance, as long as the bucket size is big > > >>> enough. > > >>> > > >>> 2. Just use -XX:+UseStringDeduplication JVM flag - as of Java 9, G1 is > > >>> the new default Java garbage collector. This garbage collector has the > > >>> option to automatically attempt to deduplicate all strings behind the > > >>> scenes, by swapping out their underlying char arrays (so, it likely > > >>> won't affect == equality because the String object references > > >>> themselves won't change, unlike option 1). This is more passive than > > >>> option 1, but would apply to the entire JVM. G1GC also implements some > > >>> heuristics to prevent too much overhead. > > >>> > > >>> With both options, it's possible to output statistics. > > >>> > > >>> If I remove the WeakHashMap for the string deduplication in > > >>> TabletLocator, does anybody have an opinion on which option I should > > >>> replace it with? I'm leaning towards option 2 (adding it to > > >>> assemble/conf/accumulo-env.sh as one of the default flags). > > >> > > > >
