[ 
https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527045#comment-16527045
 ] 

Misha Dmitriev edited comment on HIVE-19937 at 6/29/18 2:24 AM:
----------------------------------------------------------------

I took a quick look, and I am not sure this is done correctly. The code below
{code:java}
jobConf.forEach(entry -> {
  StringInternUtils.internIfNotNull(entry.getKey());
  StringInternUtils.internIfNotNull(entry.getValue());
}){code}
goes over each table entry and just invokes intern() for each key and value. 
{{intern()}} returns an existing, "canonical" string for each string that is 
duplicate. But the code doesn't store the returned strings back into the table. 
To intern both keys and values in a hashtable, you typically need to create a 
new table and effectively "intern and transfer" the contents from the old table 
to the new table. Sometimes it may be possible to be more creative and actually 
create a table with interned contents right away. Here it probably could be 
done if you added some custom kryo deserialization code for such tables. But 
maybe that's too big an effort.

As always, it would be good to measure how much memory was wasted before this 
change and saved after it. This helps to prevent errors and to see how much was 
actually achieved.

If {{jobConf}} is an instance of {{java.lang.Properties}}, and there are many 
duplicates of such tables, then memory is wasted by both string contents of 
these tables and by tables themselves (each table uses many extra Java objects 
internally). So you may consider checking the 
{{org.apache.hadoop.hive.common.CopyOnFirstWriteProperties}} class that I once 
added for a somewhat similar use case.


was (Author: mi...@cloudera.com):
I took a quick look, and I am not sure this is done correctly. The code below
{code:java}
jobConf.forEach(entry -> {
  StringInternUtils.internIfNotNull(entry.getKey());
  StringInternUtils.internIfNotNull(entry.getValue());
}){code}
goes over each table entry and just invokes intern() for each key and value. 
{{intern()}} returns an existing, "canonical" string for each string that is 
duplicate. But the code doesn't store the returned strings back into the table. 
To intern both keys and values in a hashtable, you typically need to create a 
new table and effectively "intern and transfer" the contents from the old table 
to the new table. Sometimes it may be possible to be more creative and actually 
create a table with interned contents right away. Here it probably could be 
done if you added some custom kryo deserialization code for such tables. But 
maybe that's too big an effort.

As always, it would be good to see how much memory was wasted before this 
change and saved after it. This helps to prevent errors and to see how much was 
actually achieved.

If {{jobConf}} is an instance of {{java.lang.Properties}}, and there are many 
duplicates of such tables, then memory is wasted by both string contents of 
these tables and by tables themselves (each table uses many extra Java objects 
internally). So you may consider checking the 
{{org.apache.hadoop.hive.common.CopyOnFirstWriteProperties}} class that I once 
added for a somewhat similar use case.

> Intern JobConf objects in Spark tasks
> -------------------------------------
>
>                 Key: HIVE-19937
>                 URL: https://issues.apache.org/jira/browse/HIVE-19937
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-19937.1.patch
>
>
> When fixing HIVE-16395, we decided that each new Spark task should clone the 
> {{JobConf}} object to prevent any {{ConcurrentModificationException}} from 
> being thrown. However, setting this variable comes at a cost of storing a 
> duplicate {{JobConf}} object for each Spark task. These objects can take up a 
> significant amount of memory, we should intern them so that Spark tasks 
> running in the same JVM don't store duplicate copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to