----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/60843/#review180537 -----------------------------------------------------------
Ship it! Looks good to me. - Brian Towles On July 13, 2017, 1:03 p.m., Arjun Mishra wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/60843/ > ----------------------------------------------------------- > > (Updated July 13, 2017, 1:03 p.m.) > > > Review request for sentry, Alexander Kolbasov and kalyan kumar kalvagadda. > > > Repository: sentry > > > Description > ------- > > We obtained a heap dump taken from the JVM running Hive Metastore at the time > when Sentry HDFS sync operation was performed. I've analyzed this dump with > jxray (www.jxray.com) and found that a significant percentage of memory is > wasted due to duplicate strings: > > {code} > 7. DUPLICATE STRINGS > > Total strings: 29,986,017 Unique strings: 9,640,413 Duplicate values: > 4,897,743 Overhead: 2,570,746K (9.4%) > {code} > > Of them, more than 1/3 come from sentry: > > {code} > 917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing > arrays: > <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <-- > {j.u.HashMap}.values <-- > org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- > org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java > Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate) > {code} > > The duplicate strings in memory have been eliminated by SENTRY-1811. However, > when these strings are serialized into the TPathsDump thrift message, they > are duplicated again. That is, if there are 3 different TPathEntry objects > with the same pathElement="foo", then (even if there is only one interned > copy of the "foo" string in memory), a separate copy of "foo" will be written > to the serialized message for each of these 3 TPathEntries. This is one > reason why TPathsDump serialized messages may get very big, consume a lot of > memory and take long time to send over the network. > > To address this problem we may use some form of custom compression, where we > don't write multiple copies of duplicate strings, but rather substitute them > with some shorter "string ids". > > > Diffs > ----- > > > sentry-hdfs/sentry-hdfs-common/src/gen/thrift/gen-javabean/org/apache/sentry/hdfs/service/thrift/TPathsDump.java > 722ad76d9 > > sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/AuthzPathsDumper.java > 095095710 > > sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/HMSPathsDumper.java > 479188e51 > > sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/Updateable.java > e777e4b1a > > sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPaths.java > 08a3b3e92 > > sentry-hdfs/sentry-hdfs-common/src/main/resources/sentry_hdfs_service.thrift > b0a1f877b > > sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestHMSPathsFullDump.java > 194ffb755 > > sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestUpdateableAuthzPaths.java > 9a726da27 > > sentry-hdfs/sentry-hdfs-namenode-plugin/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPermissions.java > 89a3297d4 > > sentry-hdfs/sentry-hdfs-service/src/main/java/org/apache/sentry/hdfs/PathImageRetriever.java > 2426b4079 > > > Diff: https://reviews.apache.org/r/60843/diff/1/ > > > Testing > ------- > > > Thanks, > > Arjun Mishra > >
