[ 
https://issues.apache.org/jira/browse/SENTRY-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Misha Dmitriev updated SENTRY-1811:
-----------------------------------
    Description: 
We obtained a heap dump taken from the JVM running Hive Metastore at the time 
when Sentry HDFS sync operation was performed. I've analyzed this dump with 
jxray (www.jxray.com) and found that more than 19% of memory is wasted due to 
empty or suboptimally-sized Java collections:

{code}
9. BAD COLLECTIONS

Total collections: 54,057,249  Bad collections: 31,569,606  Overhead: 
5,292,821K (19.3%)
{code}

Most of these collections come from thrift classes used by the Sentry plugin, 
see below. The associated memory waste can be significantly reduced or 
eliminated if these collections were allocated lazily and then with the initial 
capacity smaller than the default 16 elements for HashMap/HashSet.

{code}
  1,869,023K (6.8%): j.u.HashSet: 3388670 of 1-elem 979,537K (3.6%), 5897806 of 
empty 552,919K (2.0%), 1010321 of small 336,566K (1.2%)
     <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.children <--  
{j.u.HashMap}.values <-- 
org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- 
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java 
Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
  1,190,050K (4.3%): j.u.HashMap: 3382765 of 1-elem 898,546K (3.3%), 1005341 of 
small 291,503K (1.1%)
     <-- org.apache.sentry.hdfs.HMSPaths$Entry.children <-- 
org.apache.sentry.hdfs.HMSPaths$Entry.{parent} <--  {j.u.HashSet} <--  
{j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <-- 
org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
(org.apache.sentry.hdfs.MetastorePlugin)
  969,442K (3.5%): j.u.TreeSet: 5907188 of 1-elem 969,148K (3.5%)
     <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.authzObjs <--  
{j.u.HashMap}.values <-- 
org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- 
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java 
Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
  487,690K (1.8%): j.u.TreeSet: 4801877 of empty 487,690K (1.8%)
     <-- org.apache.sentry.hdfs.HMSPaths$Entry.authzObjs <-- 
org.apache.sentry.hdfs.HMSPaths$Entry.{parent} <--  {j.u.HashSet} <--  
{j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <-- 
org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
(org.apache.sentry.hdfs.MetastorePlugin)
  415,064K (1.5%): j.u.HashMap: 5897806 of empty 414,689K (1.5%)
     <-- org.apache.sentry.hdfs.HMSPaths$Entry.children <--  {j.u.HashSet} <--  
{j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <-- 
org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
(org.apache.sentry.hdfs.MetastorePlugin)
{code}

Additionally,  a significant percentage of memory is wasted due to duplicate 
strings:

{code}
7. DUPLICATE STRINGS

Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 
4,897,743  Overhead: 2,570,746K (9.4%)
{code}

Of them, more than 1/3 come from sentry:

{code}
  917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing 
arrays:
     <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  
{j.u.HashMap}.values <-- org.apache.sen
try.hdfs.service.thrift.TPathsDump.nodeMap <-- 
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Ja
va Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
{code}

These can be eliminated by inserting String.intern() calls in the appropriate 
places.

  was:
We obtained a heap dump taken from the JVM running Hive Metastore at the time 
when Sentry HDFS sync operation was performed. I've analyzed this dump with 
jxray (www.jxray.com) and found that more than 19% of memory is wasted due to 
empty or suboptimally-sized Java collections:

{code}
9. BAD COLLECTIONS

Total collections: 54,057,249  Bad collections: 31,569,606  Overhead: 
5,292,821K (19.3%)
{code}

Most of these collections come from thrift classes used by the Sentry plugin, 
see below. The associated memory waste can be significantly reduced or 
eliminated if these collections were allocated lazily and then with the initial 
capacity smaller than the default 16 elements for HashMap/HashSet.

{code}
  1,869,023K (6.8%): j.u.HashSet: 3388670 of 1-elem 979,537K (3.6%), 5897806 of 
empty 552,919K (2.0%), 1010321 of small 336,566K (1.2%)
     <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.children <--  
{j.u.HashMap}.values <-- 
org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- 
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java 
Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
  1,190,050K (4.3%): j.u.HashMap: 3382765 of 1-elem 898,546K (3.3%), 1005341 of 
small 291,503K (1.1%)
     <-- org.apache.sentry.hdfs.HMSPaths$Entry.children <-- 
org.apache.sentry.hdfs.HMSPaths$Entry.{parent} <--  {j.u.HashSet} <--  
{j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <-- 
org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
(org.apache.sentry.hdfs.MetastorePlugin)
  969,442K (3.5%): j.u.TreeSet: 5907188 of 1-elem 969,148K (3.5%)
     <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.authzObjs <--  
{j.u.HashMap}.values <-- 
org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- 
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java 
Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
  487,690K (1.8%): j.u.TreeSet: 4801877 of empty 487,690K (1.8%)
     <-- org.apache.sentry.hdfs.HMSPaths$Entry.authzObjs <-- 
org.apache.sentry.hdfs.HMSPaths$Entry.{parent} <--  {j.u.HashSet} <--  
{j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <-- 
org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
(org.apache.sentry.hdfs.MetastorePlugin)
  415,064K (1.5%): j.u.HashMap: 5897806 of empty 414,689K (1.5%)
     <-- org.apache.sentry.hdfs.HMSPaths$Entry.children <--  {j.u.HashSet} <--  
{j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <-- 
org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
(org.apache.sentry.hdfs.MetastorePlugin)
{code}


> Optimize data structures used in HDFS sync
> ------------------------------------------
>
>                 Key: SENTRY-1811
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1811
>             Project: Sentry
>          Issue Type: Improvement
>    Affects Versions: 1.8.0, sentry-ha-redesign
>            Reporter: Misha Dmitriev
>
> We obtained a heap dump taken from the JVM running Hive Metastore at the time 
> when Sentry HDFS sync operation was performed. I've analyzed this dump with 
> jxray (www.jxray.com) and found that more than 19% of memory is wasted due to 
> empty or suboptimally-sized Java collections:
> {code}
> 9. BAD COLLECTIONS
> Total collections: 54,057,249  Bad collections: 31,569,606  Overhead: 
> 5,292,821K (19.3%)
> {code}
> Most of these collections come from thrift classes used by the Sentry plugin, 
> see below. The associated memory waste can be significantly reduced or 
> eliminated if these collections were allocated lazily and then with the 
> initial capacity smaller than the default 16 elements for HashMap/HashSet.
> {code}
>   1,869,023K (6.8%): j.u.HashSet: 3388670 of 1-elem 979,537K (3.6%), 5897806 
> of empty 552,919K (2.0%), 1010321 of small 336,566K (1.2%)
>      <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.children <--  
> {j.u.HashMap}.values <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java 
> Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
>   1,190,050K (4.3%): j.u.HashMap: 3382765 of 1-elem 898,546K (3.3%), 1005341 
> of small 291,503K (1.1%)
>      <-- org.apache.sentry.hdfs.HMSPaths$Entry.children <-- 
> org.apache.sentry.hdfs.HMSPaths$Entry.{parent} <--  {j.u.HashSet} <--  
> {j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <-- 
> org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
> org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
> (org.apache.sentry.hdfs.MetastorePlugin)
>   969,442K (3.5%): j.u.TreeSet: 5907188 of 1-elem 969,148K (3.5%)
>      <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.authzObjs <--  
> {j.u.HashMap}.values <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java 
> Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
>   487,690K (1.8%): j.u.TreeSet: 4801877 of empty 487,690K (1.8%)
>      <-- org.apache.sentry.hdfs.HMSPaths$Entry.authzObjs <-- 
> org.apache.sentry.hdfs.HMSPaths$Entry.{parent} <--  {j.u.HashSet} <--  
> {j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <-- 
> org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
> org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
> (org.apache.sentry.hdfs.MetastorePlugin)
>   415,064K (1.5%): j.u.HashMap: 5897806 of empty 414,689K (1.5%)
>      <-- org.apache.sentry.hdfs.HMSPaths$Entry.children <--  {j.u.HashSet} 
> <--  {j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath 
> <-- org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
> org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
> (org.apache.sentry.hdfs.MetastorePlugin)
> {code}
> Additionally,  a significant percentage of memory is wasted due to duplicate 
> strings:
> {code}
> 7. DUPLICATE STRINGS
> Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 
> 4,897,743  Overhead: 2,570,746K (9.4%)
> {code}
> Of them, more than 1/3 come from sentry:
> {code}
>   917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing 
> arrays:
>      <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  
> {j.u.HashMap}.values <-- org.apache.sen
> try.hdfs.service.thrift.TPathsDump.nodeMap <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Ja
> va Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
> {code}
> These can be eliminated by inserting String.intern() calls in the appropriate 
> places.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to