Barnabas Maidics created HIVE-20760:
---------------------------------------
Summary: Reducing memory overhead due to multiple HiveConfs
Key: HIVE-20760
URL: https://issues.apache.org/jira/browse/HIVE-20760
Project: Hive
Issue Type: Improvement
Components: Configuration
Reporter: Barnabas Maidics
Attachments: hiveconf_interned.html, hiveconf_original.html
The issue is that every Hive task has to load its own version of {{HiveConf}}.
When running with a large number of cores per executor (HoS), there is a
significant (~10%) amount of memory wasted due to this duplication.
I looked into the problem and found a way to reduce the overhead caused by the
multiple HiveConf objects.
I've created an implementation of Properties, somewhat similar to
CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve
this problem, because it drops the interned Properties right after we add a new
property.
So my implementation looks like this:
* When we create a new HiveConf from an existing one (copy constructor), we
change the properties object stored by HiveConf to the new Properties
implementation (HiveConfProperties). We have 2 possible way to do this. Either
we change the visibility of the properties field in the ancestor class
(Configuration which comes from hadoop) to protected, or a simpler way is to
just change the type using reflection.
* HiveConfProperties instantly intern the given properties. After this, every
time we add a new property to HiveConf, we add it to an additional Properties
object. This way if we create multiple HiveConf with the same base properties,
they will use the same Properties object but each session/task can add its own
unique properties.
* Getting a property from HiveConfProperties would look like this: (I stored
the non-interned properties in super class)
String property=super.getProperty(key);
if (property == null) property= interned.getProperty(key);
return property;
Running some tests showed that the interning works (with 50 connections to
HiveServer2, heapdumps created after sessions are created for queries):
Overall memory:
original: 34,599K interned: 20,582K
Retained memory of HiveConfs:
original: 16,366K interned: 10,804K
I attach the JXray reports about the heapdumps.
What are your thoughts about this solution?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)