-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60843/
-----------------------------------------------------------

Review request for sentry, Alexander Kolbasov and kalyan kumar kalvagadda.


Repository: sentry


Description
-------

We obtained a heap dump taken from the JVM running Hive Metastore at the time 
when Sentry HDFS sync operation was performed. I've analyzed this dump with 
jxray (www.jxray.com) and found that  a significant percentage of memory is 
wasted due to duplicate strings:

{code}
7. DUPLICATE STRINGS

Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 
4,897,743  Overhead: 2,570,746K (9.4%)
{code}

Of them, more than 1/3 come from sentry:

{code}
  917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing 
arrays:
     <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  
{j.u.HashMap}.values <-- 
org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- 
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java 
Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
{code}

The duplicate strings in memory have been eliminated by SENTRY-1811. However, 
when these strings are serialized into the TPathsDump thrift message, they are 
duplicated again. That is, if there are 3 different TPathEntry objects with the 
same pathElement="foo", then (even if there is only one interned copy of the 
"foo" string in memory), a separate copy of "foo" will be written to the 
serialized message for each of these 3 TPathEntries. This is one reason why 
TPathsDump serialized messages may get very big, consume a lot of memory and 
take long time to send over the network.

To address this problem we may use some form of custom compression, where we 
don't write multiple copies of duplicate strings, but rather substitute them 
with some shorter "string ids".


Diffs
-----

  
sentry-hdfs/sentry-hdfs-common/src/gen/thrift/gen-javabean/org/apache/sentry/hdfs/service/thrift/TPathsDump.java
 722ad76d9 
  
sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/AuthzPathsDumper.java
 095095710 
  
sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/HMSPathsDumper.java
 479188e51 
  
sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/Updateable.java
 e777e4b1a 
  
sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPaths.java
 08a3b3e92 
  sentry-hdfs/sentry-hdfs-common/src/main/resources/sentry_hdfs_service.thrift 
b0a1f877b 
  
sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestHMSPathsFullDump.java
 194ffb755 
  
sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestUpdateableAuthzPaths.java
 9a726da27 
  
sentry-hdfs/sentry-hdfs-namenode-plugin/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPermissions.java
 89a3297d4 
  
sentry-hdfs/sentry-hdfs-service/src/main/java/org/apache/sentry/hdfs/PathImageRetriever.java
 2426b4079 


Diff: https://reviews.apache.org/r/60843/diff/1/


Testing
-------


Thanks,

Arjun Mishra

Reply via email to