[jira] [Updated] (HIVE-16489) HMS wastes 26.4% of memory due to dup strings in metastore.api.Partition.parameters

Misha Dmitriev (JIRA) Thu, 20 Apr 2017 16:45:41 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-16489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Misha Dmitriev updated HIVE-16489:
----------------------------------
    Description: 
I've just analyzed an HMS heap dump. It turns out that it contains a lot of 
duplicate strings, that waste 26.4% of the heap. Most of them come from 
HashMaps referenced by 
org.apache.hadoop.hive.metastore.api.Partition.parameters. Below is the 
relevant section of the jxray (www.jxray.com) report. Looking at 
Partition.java, I see that in the past somebody has already added code to 
intern keys and values in the parameters table when it's first set up. However, 
looks like when more key-value pairs are added, they are not interned, and that 
probably explains the reason for all these duplicate strings.

{code}
6. DUPLICATE STRINGS

Total strings: 3,273,557  Unique strings: 460,390  Duplicate values: 110,232  
Overhead: 3,220,458K (26.4%)

Top duplicate strings:
    Ovhd         Num char[]s   Num objs   Value

 46,088K (0.4%)     5871        5871      
"HBa4rRAAGx2MEmludGVyZXN0cmF0ZXNwcmVhZBgM/wD/AP8AXAAAAqEAERYBFQAXAAAAAAAAIEAWuK0QAA1s
 ...[length 4000]"
 46,088K (0.4%)     5871        5871      
"BQcHBQUGBQgGBQcHCAUGCAkECQcFBQwGBgoJBQYHBQUFBQYKBQgIBgUJEgYFDAYJBgcGBAcLBQYGCAgGCQYG
 ...[length 4000]"
...

===================================================

7. REFERENCE CHAINS FOR DUPLICATE STRINGS

  2,326,150K (19.1%), 597058 dup strings (36386 unique), 597058 dup backing 
arrays:
39949 of "-1", 39088 of "true", 28959 of "8", 20987 of "1", 18437 of "10", 9583 
of "9", 5908 of "269664", 5691 of "174528", 4598 of "133980", 4598 of 
"BgUGBQgFCAYFCgYIBgUEBgQHBgUGCwYGBwYHBgkKBwYGBggIBwUHBgYGCgUJCQUG ...[length 
3560]"
... and 419200 more strings, of which 36376 are unique
Also contains one-char strings: 217 of "6", 147 of "7", 91 of "4", 28 of "5", 
28 of "2", 21 of "0"
     <--  {j.u.HashMap}.values <-- 
org.apache.hadoop.hive.metastore.api.Partition.parameters <--  {j.u.ArrayList} 
<-- 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result.success
 <-- Java Local 
(org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result)
 [@6e33618d8,@6eedb9a80,@6eedbad68,@6eedbc788] ... and 3 more GC roots
  463,060K (3.8%), 119644 dup strings (34075 unique), 119644 dup backing arrays:
7914 of "true", 7912 of "-1", 6578 of "8", 5606 of "1", 2302 of "10", 1626 of 
"174528", 1223 of "9", 970 of "171680", 837 of "269664", 657 of "133980"
... and 84009 more strings, of which 34065 are unique
Also contains one-char strings: 42 of "7", 31 of "6", 20 of "4", 8 of "5", 5 of 
"2", 3 of "0"
     <--  {j.u.HashMap}.values <-- 
org.apache.hadoop.hive.metastore.api.Partition.parameters <--  
{j.u.TreeMap}.values <-- Java Local (j.u.TreeMap) [@6f084afa0,@73aac9e68]
  233,384K (1.9%), 64601 dup strings (27295 unique), 64601 dup backing arrays:
4472 of "true", 4173 of "-1", 3798 of "1", 3591 of "8", 813 of "174528", 684 of 
"10", 623 of "CQUJBQcFCAcGBwUFCgUIDAgEBwgFBQcHBwgGBwYEBQoLCggFCAYHBgcIBwkIDgcG 
...[length 4000]", 623 of 
"BQcHBQUGBQgGBQcHCAUGCAkECQcFBQwGBgoJBQYHBQUFBQYKBQgIBgUJEgYFDAYJ ...[length 
4000]", 623 of 
"BgUGBQgFCAYFCgYIBgUEBgQHBgUGCwYGBwYHBgkKBwYGBggIBwUHBgYGCgUJCQUG ...[length 
3560]", 623 of 
"AAMAAAEAAAAAAAEAAAAAAQABAAEHAwAKAgAEAwAAAAAAAgAEAAAAAAMAAAADAAAA ...[length 
4000]"
... and 44568 more strings, of which 27285 are unique
Also contains one-char strings: 305 of "7", 301 of "0", 277 of "4", 146 of "6", 
29 of "2", 23 of "5", 19 of "9", 2 of "3"
     <--  {j.u.HashMap}.values <-- 
org.apache.hadoop.hive.metastore.api.Partition.parameters <--  {j.u.ArrayList} 
<-- Java Local (j.u.ArrayList) [@4f4cfbd10,@536122408,@726616778]
...
{code}

  was:
I've created a Hive table with 2000 partitions, each backed by two files, with 
one row in each file. When I execute some number of concurrent queries against 
this table, e.g. as follows

{code}
for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:10000 -n admin -p 
admin -e "select count(i_f_1) from misha_table;" & done
{code}

it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
server with -Xmx200m and with 50 queries - in the one with -Xmx500m.

I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
that was generated in the 50queries/500m heap scenario. It suggests that there 
are several opportunities to reduce memory pressure with not very invasive 
changes to the code. One (duplicate strings) has been addressed in 
https://issues.apache.org/jira/browse/HIVE-15882 In this ticket, I am going to 
address the fact that almost 20% of memory is used by instances of 
java.util.Properties. These objects are highly duplicate, since for each 
partition each concurrently running query creates its own copy of Partion, 
PartitionDesc and Properties. Thus we have nearly 100,000 (50 queries * 2,000 
partitions) Properties in memory. By interning/deduplicating these objects we 
may be able to save perhaps 15% of memory.

Note, however, that if there are queries that mutate partitions, the 
corresponding Properties would be mutated as well. Thus we cannot simply use a 
single "canonicalized" Properties object at all times for all Partition objects 
representing the same DB partition. Instead, I am going to introduce a special 
CopyOnFirstWriteProperties class. Such an object initially internally 
references a canonicalized Properties object, and keeps doing so while only 
read methods are called. However, once any mutating method is called, the given 
CopyOnFirstWriteProperties copies the data into its own table from the 
canonicalized table, and uses it ever after.


> HMS wastes 26.4% of memory due to dup strings in 
> metastore.api.Partition.parameters
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-16489
>                 URL: https://issues.apache.org/jira/browse/HIVE-16489
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>
> I've just analyzed an HMS heap dump. It turns out that it contains a lot of 
> duplicate strings, that waste 26.4% of the heap. Most of them come from 
> HashMaps referenced by 
> org.apache.hadoop.hive.metastore.api.Partition.parameters. Below is the 
> relevant section of the jxray (www.jxray.com) report. Looking at 
> Partition.java, I see that in the past somebody has already added code to 
> intern keys and values in the parameters table when it's first set up. 
> However, looks like when more key-value pairs are added, they are not 
> interned, and that probably explains the reason for all these duplicate 
> strings.
> {code}
> 6. DUPLICATE STRINGS
> Total strings: 3,273,557  Unique strings: 460,390  Duplicate values: 110,232  
> Overhead: 3,220,458K (26.4%)
> Top duplicate strings:
>     Ovhd         Num char[]s   Num objs   Value
>  46,088K (0.4%)     5871        5871      
> "HBa4rRAAGx2MEmludGVyZXN0cmF0ZXNwcmVhZBgM/wD/AP8AXAAAAqEAERYBFQAXAAAAAAAAIEAWuK0QAA1s
>  ...[length 4000]"
>  46,088K (0.4%)     5871        5871      
> "BQcHBQUGBQgGBQcHCAUGCAkECQcFBQwGBgoJBQYHBQUFBQYKBQgIBgUJEgYFDAYJBgcGBAcLBQYGCAgGCQYG
>  ...[length 4000]"
> ...
> ===================================================
> 7. REFERENCE CHAINS FOR DUPLICATE STRINGS
>   2,326,150K (19.1%), 597058 dup strings (36386 unique), 597058 dup backing 
> arrays:
> 39949 of "-1", 39088 of "true", 28959 of "8", 20987 of "1", 18437 of "10", 
> 9583 of "9", 5908 of "269664", 5691 of "174528", 4598 of "133980", 4598 of 
> "BgUGBQgFCAYFCgYIBgUEBgQHBgUGCwYGBwYHBgkKBwYGBggIBwUHBgYGCgUJCQUG ...[length 
> 3560]"
> ... and 419200 more strings, of which 36376 are unique
> Also contains one-char strings: 217 of "6", 147 of "7", 91 of "4", 28 of "5", 
> 28 of "2", 21 of "0"
>      <--  {j.u.HashMap}.values <-- 
> org.apache.hadoop.hive.metastore.api.Partition.parameters <--  
> {j.u.ArrayList} <-- 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result.success
>  <-- Java Local 
> (org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result)
>  [@6e33618d8,@6eedb9a80,@6eedbad68,@6eedbc788] ... and 3 more GC roots
>   463,060K (3.8%), 119644 dup strings (34075 unique), 119644 dup backing 
> arrays:
> 7914 of "true", 7912 of "-1", 6578 of "8", 5606 of "1", 2302 of "10", 1626 of 
> "174528", 1223 of "9", 970 of "171680", 837 of "269664", 657 of "133980"
> ... and 84009 more strings, of which 34065 are unique
> Also contains one-char strings: 42 of "7", 31 of "6", 20 of "4", 8 of "5", 5 
> of "2", 3 of "0"
>      <--  {j.u.HashMap}.values <-- 
> org.apache.hadoop.hive.metastore.api.Partition.parameters <--  
> {j.u.TreeMap}.values <-- Java Local (j.u.TreeMap) [@6f084afa0,@73aac9e68]
>   233,384K (1.9%), 64601 dup strings (27295 unique), 64601 dup backing arrays:
> 4472 of "true", 4173 of "-1", 3798 of "1", 3591 of "8", 813 of "174528", 684 
> of "10", 623 of 
> "CQUJBQcFCAcGBwUFCgUIDAgEBwgFBQcHBwgGBwYEBQoLCggFCAYHBgcIBwkIDgcG ...[length 
> 4000]", 623 of 
> "BQcHBQUGBQgGBQcHCAUGCAkECQcFBQwGBgoJBQYHBQUFBQYKBQgIBgUJEgYFDAYJ ...[length 
> 4000]", 623 of 
> "BgUGBQgFCAYFCgYIBgUEBgQHBgUGCwYGBwYHBgkKBwYGBggIBwUHBgYGCgUJCQUG ...[length 
> 3560]", 623 of 
> "AAMAAAEAAAAAAAEAAAAAAQABAAEHAwAKAgAEAwAAAAAAAgAEAAAAAAMAAAADAAAA ...[length 
> 4000]"
> ... and 44568 more strings, of which 27285 are unique
> Also contains one-char strings: 305 of "7", 301 of "0", 277 of "4", 146 of 
> "6", 29 of "2", 23 of "5", 19 of "9", 2 of "3"
>      <--  {j.u.HashMap}.values <-- 
> org.apache.hadoop.hive.metastore.api.Partition.parameters <--  
> {j.u.ArrayList} <-- Java Local (j.u.ArrayList) 
> [@4f4cfbd10,@536122408,@726616778]
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16489) HMS wastes 26.4% of memory due to dup strings in metastore.api.Partition.parameters

Reply via email to