Re: Which String deduplication option?

David Mon, 08 Feb 2021 13:31:35 -0800

Guava argues for the use of a weak hashmap.

https://github.com/google/guava/blob/master/guava/src/com/google/common/collect/Interner.java#L28-L30


On Mon, Feb 8, 2021 at 3:57 PM Brian Loss <brianl...@gmail.com> wrote:

> It might make sense to do both approaches.
>
> It seems there are limits to when -XX:+UseStringDeduplication takes
> effect. By default, it only interns objects that have survived 3 GC cycles,
> although that number can be changed. If the objects in question are
> short-lived, then it wouldn’t make sense to call String.intern on them
> either. However, if we know based on the usage pattern that we’d get a lot
> of deduplication on long-lived strings, then String.intern is better
> because it will happen right away and will also save more memory since the
> String object itself is de-duped (vs just the internal char array for the
> automatic de-duplication). It wasn’t completely clear from my reading, but
> if I understood correctly the other potential downside to
> UseStringDeduplication is that it happens after GC if there’s time. On a
> heavily loaded system that doesn’t time left in the pause time goal window
> after completing GC, the string de-duplication might not happen at all.
>
> Adding -XX:+UseStringDeduplication wouldn’t hurt and could potentially
> provide some benefit, so I’d be in favor of adding it. For TabletLocator
> specifically, if we know that’s an area where string de-duplication will
> help, then we should probably use String.intern there. As Keith suggested,
> a stress test might help determine whether it makes sense. In the absence
> of that, if we assume the previous WeakHashMap was there to solve a
> specific problem (vs an uninformed attempt to save memory) then
> String.intern sounds to me like the way to go as well.
>
> > On Feb 8, 2021, at 3:34 PM, Dave Marion <dmario...@gmail.com> wrote:
> >
> > String.intern() would seem to provide better coverage considering that
> some
> > users may not use the G1 collector.
> >
> > On Mon, Feb 8, 2021 at 3:19 PM Keith Turner <ke...@deenlo.com> wrote:
> >
> >> Recently while running some large map reduce jobs I learned that
> >> Hadoop uses String.intern() in its RPC code (below is a link to an
> >> example on one place where Hadoop does this).  I learned this because
> >> when I ran jstack on NN, RM, and/or AM that were under distress
> >> sometimes I kept seeing RPC server threads that were in
> >> String.intern().  I never was quite sure if it was a problem though.
> >> Not saying String.intern() is bad or good, just sharing something I
> >> observed that I was uncertain about.
> >>
> >> May make sense to create some sort of stress test that could simulate
> >> the usage pattern of the TabletLocator and try the different options
> >> and see what happens.  If any long pauses or problems happen in the
> >> simulation, they may happen in the real environment.
> >>
> >>
> >>
> https://github.com/apache/hadoop/blob/ba631c436b806728f8ec2f54ab1e289526c90579/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/TaskStatus.java#L481
> >>
> >>
> https://github.com/apache/hadoop/blob/ba631c436b806728f8ec2f54ab1e289526c90579/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringInterner.java#L67
> >>
> >> On Mon, Feb 1, 2021 at 9:55 PM Christopher <ctubb...@apache.org> wrote:
> >>>
> >>> While code reviewing, I saw that
> >>>
> core/src/main/java/org/apache/accumulo/core/clientImpl/TabletLocator.java
> >>> was using a WeakHashMap to deduplicate some strings.
> >>>
> >>> This code can probably be removed in favor of one of the following two
> >> options:
> >>>
> >>> 1. Just explicitly use String.intern() - As of Java 7, there is no
> >>> longer a separate, fixed-size PermGen space, so intern'd strings will
> >>> be in the main heap, no longer constrained to a limited size pool.
> >>> These strings are still subject to garbage collection. It is
> >>> implemented as a HashMap internally (native implementation), with a
> >>> default bucket size of more than 60K, plenty big enough for the
> >>> interning that TabletLocator is doing... but this is configurable by
> >>> the user with JVM flags if it's not. Interning will use less memory as
> >>> WeakHashMap and similar performance, as long as the bucket size is big
> >>> enough.
> >>>
> >>> 2. Just use -XX:+UseStringDeduplication JVM flag - as of Java 9, G1 is
> >>> the new default Java garbage collector. This garbage collector has the
> >>> option to automatically attempt to deduplicate all strings behind the
> >>> scenes, by swapping out their underlying char arrays (so, it likely
> >>> won't affect == equality because the String object references
> >>> themselves won't change, unlike option 1). This is more passive than
> >>> option 1, but would apply to the entire JVM. G1GC also implements some
> >>> heuristics to prevent too much overhead.
> >>>
> >>> With both options, it's possible to output statistics.
> >>>
> >>> If I remove the WeakHashMap for the string deduplication in
> >>> TabletLocator, does anybody have an opinion on which option I should
> >>> replace it with? I'm leaning towards option 2 (adding it to
> >>> assemble/conf/accumulo-env.sh as one of the default flags).
> >>
>
>

Re: Which String deduplication option?

Reply via email to