[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35610: -- Component/s: (was: Tests) > Memory leak in Spark interpreter > - > > Key: SPARK-35610 > URL: https://issues.apache.org/jira/browse/SPARK-35610 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > I have identified this leak by running the Livy tests (I know it is close to > the attic but this leak causes a constant OOM there) and it is in our Spark > unit tests as well. > This leak can be identified by checking the number of LeakyEntry in case of > Scala 2.12.14 (and ZipEntry for Scala 2.12.10) instances which with its > related data can take up a considerable amount of memory (as those are > created from the jars which are on the classpath). > I have my own tool for instrumenting JVM code > ([trace-agent|https://github.com/attilapiros/trace-agent]) and with that I am > able to call JVM diagnostic commands at specific methods. Let me show how it > in action. > > It has a single text file embedded into the tool's jar called action.txt. > In this case actions.txt content is: > {noformat} > $ unzip -q -c trace-agent-0.0.7.jar actions.txt > diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter > cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true > diagnostic_command org.apache.spark.repl.ReplSuite afterAll > cmd:gcClassHistogram,limit_output_lines:8,where:after,with_gc:true > {noformat} > Which creates a class histogram at the beginning and at the end of > org.apache.spark.repl.ReplSuite#runInterpreter() (after triggering a GC which > might not finish as GC is done in a separate thread..) and one histogram in > the end of the afterAll() method. > And the histograms are the followings on master branch: > {noformat} > $ ./build/sbt ";project repl;set Test/javaOptions += > \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; > testOnly" |grep "ZipEntry\|LeakyEntry" >3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry >3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry >3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry >3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry >3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry >3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry >3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry >3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry >3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry >3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry >3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry >3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry >3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry >3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry >3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry >3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry >3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry >3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry >3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry > {noformat} > Where the header of the table is: > {noformat} > num #instances #bytes class name > {noformat} > So the LeakyEntry in the end is about 75MB (173MB in case of Scala 2.12.10 > and before for another class called ZipEntry) but the first item (a char/byte > arrays) and the second item (strings) in the histogram also relates to this > leak: > {noformat} > $ ./build/sbt ";project repl;set Test/javaOptions += > \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; > testOnly" |grep "1:\|2:\|3:" >1: 27013496112 [B >2: 218552607192 [C >3: 4885 537264 java.lang.Class >1:480323 55970208 [C >2:480499 11531976 java.lang.String >3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry >1:481825 56148024 [C >2:481998 11567952 java.lang.String >3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry >1:487056
[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-35610: --- Description: I have identified this leak by running the Livy tests (I know it is close to the attic but this leak causes a constant OOM there) and it is in our Spark unit tests as well. This leak can be identified by checking the number of LeakyEntry in case of Scala 2.12.14 (and ZipEntry for Scala 2.12.10) instances which with its related data can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool for instrumenting JVM code ([trace-agent|https://github.com/attilapiros/trace-agent]) and with that I am able to call JVM diagnostic commands at specific methods. Let me show how it in action. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true diagnostic_command org.apache.spark.repl.ReplSuite afterAll cmd:gcClassHistogram,limit_output_lines:8,where:after,with_gc:true {noformat} Which creates a class histogram at the beginning and at the end of org.apache.spark.repl.ReplSuite#runInterpreter() (after triggering a GC which might not finish as GC is done in a separate thread..) and one histogram in the end of the afterAll() method. And the histograms are the followings on master branch: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "ZipEntry\|LeakyEntry" 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry {noformat} Where the header of the table is: {noformat} num #instances #bytes class name {noformat} So the LeakyEntry in the end is about 75MB (173MB in case of Scala 2.12.10 and before for another class called ZipEntry) but the first item (a char/byte arrays) and the second item (strings) in the histogram also relates to this leak: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "1:\|2:\|3:" 1: 27013496112 [B 2: 218552607192 [C 3: 4885 537264 java.lang.Class 1:480323 55970208 [C 2:480499 11531976 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:481825 56148024 [C 2:481998 11567952 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:487056 57550344 [C 2:487179 11692296 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:487054 57551008 [C 2:487176 11692224 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:927823 107139160 [C 2:928072 22273728 java.lang.String 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1:927793 107129328 [C 2:928041 22272984 java.lang.String 3:394178
[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-35610: --- Description: I have identified this leak by running the Livy tests (I know it is close to the attic but this leak causes a constant OOM there) and it is in our Spark unit tests as well. This leak can be identified by checking the number of LeakyEntry in case of Scala 2.12.14 (and ZipEntry for Scala 2.12.10) instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code ([trace-agent|https://github.com/attilapiros/trace-agent]) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true diagnostic_command org.apache.spark.repl.ReplSuite afterAll cmd:gcClassHistogram,limit_output_lines:8,where:after,with_gc:true {noformat} Which creates a class histogram at the beginning and at the end of org.apache.spark.repl.ReplSuite#runInterpreter() (after triggering a GC which might not finish as GC is done in a separate thread..) and one histogram in the end of the afterAll() method. And the histograms are the followings on master branch: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "ZipEntry\|LeakyEntry" 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry {noformat} Where the header of the table is: {noformat} num #instances #bytes class name {noformat} So the LeakyEntry in the end is about 75MB (173MB in case of Scala 2.12.10 and before for another class called ZipEntry) but the first item (a char/byte arrays) and the second item (strings) in the histogram also relates to this leak: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "1:\|2:\|3:" 1: 27013496112 [B 2: 218552607192 [C 3: 4885 537264 java.lang.Class 1:480323 55970208 [C 2:480499 11531976 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:481825 56148024 [C 2:481998 11567952 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:487056 57550344 [C 2:487179 11692296 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:487054 57551008 [C 2:487176 11692224 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:927823 107139160 [C 2:928072 22273728 java.lang.String 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1:927793 107129328 [C 2:928041 22272984 java.lang.String 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEnt
[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-35610: --- Description: I have identified this leak by running the Livy tests (I know it is close to the attic but this leak causes a constant OOM there) and it is in our Spark unit tests as well. This leak can be identified by checking the number of LeakyEntry in case of Scala 2.12.14 (and ZipEntry fo Scala 2.12.10) instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code ([trace-agent|https://github.com/attilapiros/trace-agent]) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true diagnostic_command org.apache.spark.repl.ReplSuite afterAll cmd:gcClassHistogram,limit_output_lines:8,where:after,with_gc:true {noformat} Which creates a class histogram at the beginning and at the end of org.apache.spark.repl.ReplSuite#runInterpreter() (after triggering a GC which might not finish as GC is done in a separate thread..) and one histogram in the end of the afterAll() method. And the histograms are the followings on master branch: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "ZipEntry\|LeakyEntry" 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry {noformat} Where the header of the table is: {noformat} num #instances #bytes class name {noformat} So the LeakyEntry in the end is about 75MB (173MB in case of Scala 2.12.10 and before for another class called ZipEntry) but the first item (a char/byte arrays) and the second item (strings) in the histogram also relates to this leak: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "1:\|2:\|3:" 1: 27013496112 [B 2: 218552607192 [C 3: 4885 537264 java.lang.Class 1:480323 55970208 [C 2:480499 11531976 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:481825 56148024 [C 2:481998 11567952 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:487056 57550344 [C 2:487179 11692296 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:487054 57551008 [C 2:487176 11692224 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:927823 107139160 [C 2:928072 22273728 java.lang.String 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1:927793 107129328 [C 2:928041 22272984 java.lang.String 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntr
[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-35610: --- Description: I have identified this leak by running the Livy tests (I know it is close to the attic but this leak causes a constant OOM there) and it is in our Spark unit tests as well. This leak can be identified by checking the number of LeakyEntry in case of Scala 2.12.14 (and ZipEntry fo Scala 2.12.10) instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code ([trace-agent|https://github.com/attilapiros/trace-agent]) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true {noformat} Which creates a class histogram at the beginning and at the end of org.apache.spark.repl.ReplSuite#runInterpreter() (after triggering a GC which might not finish as GC is done in a separate thread..). And the histograms are the followings on master branch: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "ZipEntry\|LeakyEntry" 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry {noformat} Where the header of the table is: {noformat} num #instances #bytes class name {noformat} So it ZipEntry instances altogether is about 75MB (173MB in case of Scala 2.12.10 and before with the ZipEntry), but the first item a char/byte array and the second item strings in the histogram also relates to this leak: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "1:\|2:\|3:" 1: 27013496112 [B 2: 218552607192 [C 3: 4885 537264 java.lang.Class 1:480323 55970208 [C 2:480499 11531976 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:481825 56148024 [C 2:481998 11567952 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:487056 57550344 [C 2:487179 11692296 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:487054 57551008 [C 2:487176 11692224 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:927823 107139160 [C 2:928072 22273728 java.lang.String 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1:927793 107129328 [C 2:928041 22272984 java.lang.String 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1361851 15608 [C 2: 1362261 32694264 java.lang.String 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1361683
[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-35610: --- Description: I have identified this leak by running the Livy tests (I know it is close to the attic but this leak causes a constant OOM there) and it is in our Spark unit tests as well. This leak can be identified by checking the number of LeakyEntry in case of Scala 2.12.14 (and ZipEntry fo Scala 2.12.10) instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code ([trace-agent|https://github.com/attilapiros/trace-agent]) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true {noformat} Which creates a class histogram at the beginning and at the end of org.apache.spark.repl.ReplSuite#runInterpreter() (after triggering a GC which might not finish as GC is done in a separate thread..). And the histograms are the followings on master branch: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "ZipEntry\|LeakyEntry" 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3:788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3:985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry {noformat} Where the header of the table is: {noformat} num #instances #bytes class name {noformat} So it ZipEntry instances altogether is about 75MB (173MB in case of Scala 2.12.10 and before with the ZipEntry), but the first item in the histogram would the char/byte array which also relates to this leak: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "1:\|2:\|3:" 1: 27013496112 [B 2: 218552607192 [C 3: 4885 537264 java.lang.Class 1:480323 55970208 [C 2:480499 11531976 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:481825 56148024 [C 2:481998 11567952 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:487056 57550344 [C 2:487179 11692296 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:487054 57551008 [C 2:487176 11692224 java.lang.String 3:1970899460272 scala.reflect.io.FileZipArchive$LeakyEntry 1:927823 107139160 [C 2:928072 22273728 java.lang.String 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1:927793 107129328 [C 2:928041 22272984 java.lang.String 3:394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1361851 15608 [C 2: 1362261 32694264 java.lang.String 3:591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1361683 155493464 [C
[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-35610: --- Description: I have identified this leak by running the Livy tests (I know it is close to the attic but this leak causes a constant OOM there) and it is in our Spark unit tests as well. This leak can be identified by checking the number of ZipEntry instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code ([trace-agent|https://github.com/attilapiros/trace-agent]) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true {noformat} Which creates a class histogram at the beginning and at the end of org.apache.spark.repl.ReplSuite#runInterpreter() (after triggering a GC which might not finish as GC is done in a separate thread..). And the histograms are the followings on master branch: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" | grep "ZipEntry" 2:196797 15743760 java.util.zip.ZipEntry 2:196797 15743760 java.util.zip.ZipEntry 2:393594 31487520 java.util.zip.ZipEntry 2:393594 31487520 java.util.zip.ZipEntry 2:590391 47231280 java.util.zip.ZipEntry 2:590391 47231280 java.util.zip.ZipEntry 2:787188 62975040 java.util.zip.ZipEntry 2:787188 62975040 java.util.zip.ZipEntry 2:983985 78718800 java.util.zip.ZipEntry 2:983985 78718800 java.util.zip.ZipEntry 2: 1180782 94462560 java.util.zip.ZipEntry 2: 1180782 94462560 java.util.zip.ZipEntry 2: 1377579 110206320 java.util.zip.ZipEntry 2: 1377579 110206320 java.util.zip.ZipEntry 2: 1574376 125950080 java.util.zip.ZipEntry 2: 1574376 125950080 java.util.zip.ZipEntry 2: 1771173 141693840 java.util.zip.ZipEntry 2: 1771173 141693840 java.util.zip.ZipEntry 2: 1967970 157437600 java.util.zip.ZipEntry Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2: 1967970 157437600 java.util.zip.ZipEntry 2: 2164767 173181360 java.util.zip.ZipEntry {noformat} Where the header of the table is: {noformat} num #instances #bytes class name {noformat} So it ZipEntry instances altogether is about 173MB, but the first item in the histogram would the char/byte array which also relates to this leak: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" | grep "1:" 1: 26193185752 [B 1:480784 55931000 [C 1:480969 55954072 [C 1:912647 104092392 [C 1:912552 104059536 [C 1: 1354362 153683280 [C 1: 1354332 153673448 [C 1: 1789703 202088704 [C 1: 1789676 202079056 [C 1: 2232868 251789104 [C 1: 2232248 251593392 [C 1: 2667318 300297664 [C 1: 2667203 300256912 [C 1: 3100253 348498384 [C 1: 3100250 348498896 [C 1: 3533763 396801848 [C 1: 3533725 396789720 [C 1: 3967515 445141784 [C 1: 3967459 445128328 [C 1: 4401309 493509768 [C Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 1: 4401236 493496752 [C 1: 4836168 541965464 [C {noformat} This is 541MB. was: I have identified this leak by running the Livy tests (I know it is close to attic but this leak causes a constant OOM there) but it is in our Spark unit tests as well. This leak can be identified by checking the number of ZipEntry instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code ([trace-agent|https://github.com/attilapiros/trace-agent]) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called act
[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-35610: --- Description: I have identified this leak by running the Livy tests (I know it is close to attic but this leak causes a constant OOM there) but it is in our Spark unit tests as well. This leak can be identified by checking the number of ZipEntry instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code ([trace-agent|https://github.com/attilapiros/trace-agent]) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true {noformat} Which creates a class histogram at the beginning and at the end of org.apache.spark.repl.ReplSuite#runInterpreter() (after triggering a GC which might not finish as GC is done in a separate thread..). And the histograms are the followings on master branch: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" | grep "ZipEntry" 2:196797 15743760 java.util.zip.ZipEntry 2:196797 15743760 java.util.zip.ZipEntry 2:393594 31487520 java.util.zip.ZipEntry 2:393594 31487520 java.util.zip.ZipEntry 2:590391 47231280 java.util.zip.ZipEntry 2:590391 47231280 java.util.zip.ZipEntry 2:787188 62975040 java.util.zip.ZipEntry 2:787188 62975040 java.util.zip.ZipEntry 2:983985 78718800 java.util.zip.ZipEntry 2:983985 78718800 java.util.zip.ZipEntry 2: 1180782 94462560 java.util.zip.ZipEntry 2: 1180782 94462560 java.util.zip.ZipEntry 2: 1377579 110206320 java.util.zip.ZipEntry 2: 1377579 110206320 java.util.zip.ZipEntry 2: 1574376 125950080 java.util.zip.ZipEntry 2: 1574376 125950080 java.util.zip.ZipEntry 2: 1771173 141693840 java.util.zip.ZipEntry 2: 1771173 141693840 java.util.zip.ZipEntry 2: 1967970 157437600 java.util.zip.ZipEntry Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2: 1967970 157437600 java.util.zip.ZipEntry 2: 2164767 173181360 java.util.zip.ZipEntry {noformat} Where the header of the table is: {noformat} num #instances #bytes class name {noformat} So it ZipEntry instances altogether is about 173MB, but the first item in the histogram would the char/byte array which also relates to this leak: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" | grep "1:" 1: 26193185752 [B 1:480784 55931000 [C 1:480969 55954072 [C 1:912647 104092392 [C 1:912552 104059536 [C 1: 1354362 153683280 [C 1: 1354332 153673448 [C 1: 1789703 202088704 [C 1: 1789676 202079056 [C 1: 2232868 251789104 [C 1: 2232248 251593392 [C 1: 2667318 300297664 [C 1: 2667203 300256912 [C 1: 3100253 348498384 [C 1: 3100250 348498896 [C 1: 3533763 396801848 [C 1: 3533725 396789720 [C 1: 3967515 445141784 [C 1: 3967459 445128328 [C 1: 4401309 493509768 [C Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 1: 4401236 493496752 [C 1: 4836168 541965464 [C {noformat} This is 541MB. was: I have identified this leak by running the Livy tests (I know it is close to attic but this leak causes a constant OOM there) but it is in our Spark unit tests as well. This leak can be identified by checking the number of ZipEntry instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code (https://github.com/attilapiros/trace-agent) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this
[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-35610: --- Description: I have identified this leak by running the Livy tests (I know it is close to attic but this leak causes a constant OOM there) but it is in our Spark unit tests as well. This leak can be identified by checking the number of ZipEntry instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code (https://github.com/attilapiros/trace-agent) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true {noformat} Which creates a class histogram at the beginning and at the end of org.apache.spark.repl.ReplSuite#runInterpreter() (after triggering a GC which might not finish as GC is done in a separate thread..). And the histograms are the followings on master branch: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" | grep "ZipEntry" 2:196797 15743760 java.util.zip.ZipEntry 2:196797 15743760 java.util.zip.ZipEntry 2:393594 31487520 java.util.zip.ZipEntry 2:393594 31487520 java.util.zip.ZipEntry 2:590391 47231280 java.util.zip.ZipEntry 2:590391 47231280 java.util.zip.ZipEntry 2:787188 62975040 java.util.zip.ZipEntry 2:787188 62975040 java.util.zip.ZipEntry 2:983985 78718800 java.util.zip.ZipEntry 2:983985 78718800 java.util.zip.ZipEntry 2: 1180782 94462560 java.util.zip.ZipEntry 2: 1180782 94462560 java.util.zip.ZipEntry 2: 1377579 110206320 java.util.zip.ZipEntry 2: 1377579 110206320 java.util.zip.ZipEntry 2: 1574376 125950080 java.util.zip.ZipEntry 2: 1574376 125950080 java.util.zip.ZipEntry 2: 1771173 141693840 java.util.zip.ZipEntry 2: 1771173 141693840 java.util.zip.ZipEntry 2: 1967970 157437600 java.util.zip.ZipEntry Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2: 1967970 157437600 java.util.zip.ZipEntry 2: 2164767 173181360 java.util.zip.ZipEntry {noformat} Where the header of the table is: {noformat} num #instances #bytes class name {noformat} So it ZipEntry instances altogether is about 173MB, but the first item in the histogram would the char/byte array which also relates to this leak: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" | grep "1:" 1: 26193185752 [B 1:480784 55931000 [C 1:480969 55954072 [C 1:912647 104092392 [C 1:912552 104059536 [C 1: 1354362 153683280 [C 1: 1354332 153673448 [C 1: 1789703 202088704 [C 1: 1789676 202079056 [C 1: 2232868 251789104 [C 1: 2232248 251593392 [C 1: 2667318 300297664 [C 1: 2667203 300256912 [C 1: 3100253 348498384 [C 1: 3100250 348498896 [C 1: 3533763 396801848 [C 1: 3533725 396789720 [C 1: 3967515 445141784 [C 1: 3967459 445128328 [C 1: 4401309 493509768 [C Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 1: 4401236 493496752 [C 1: 4836168 541965464 [C {noformat} This is 541MB. was: I have identified this leak by running the Livy tests (I know it is close to attic but this leak causes a constant OOM there) but it is in our Spark unit tests as well. This leak can be identified by checking the number of ZipEntry instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code (https://github.com/attilapiros/trace-agent) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.
[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-35610: --- Description: I have identified this leak by running the Livy tests (I know it is close to attic but this leak causes a constant OOM there) but it is in our Spark unit tests as well. This leak can be identified by checking the number of ZipEntry instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code (https://github.com/attilapiros/trace-agent) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:before,with_gc:true diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:after,with_gc:true {noformat} Which creates a class histogram at the beginning and at the end of org.apache.spark.repl.ReplSuite#runInterpreter() (after triggering a GC which might not finish as GC is done in a separate thread..). And the histograms are the followings on master branch: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" | grep "ZipEntry" 2:196797 15743760 java.util.zip.ZipEntry 2:196797 15743760 java.util.zip.ZipEntry 2:393594 31487520 java.util.zip.ZipEntry 2:393594 31487520 java.util.zip.ZipEntry 2:590391 47231280 java.util.zip.ZipEntry 2:590391 47231280 java.util.zip.ZipEntry 2:787188 62975040 java.util.zip.ZipEntry 2:787188 62975040 java.util.zip.ZipEntry 2:983985 78718800 java.util.zip.ZipEntry 2:983985 78718800 java.util.zip.ZipEntry 2: 1180782 94462560 java.util.zip.ZipEntry 2: 1180782 94462560 java.util.zip.ZipEntry 2: 1377579 110206320 java.util.zip.ZipEntry 2: 1377579 110206320 java.util.zip.ZipEntry 2: 1574376 125950080 java.util.zip.ZipEntry 2: 1574376 125950080 java.util.zip.ZipEntry 2: 1771173 141693840 java.util.zip.ZipEntry 2: 1771173 141693840 java.util.zip.ZipEntry 2: 1967970 157437600 java.util.zip.ZipEntry Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2: 1967970 157437600 java.util.zip.ZipEntry 2: 2164767 173181360 java.util.zip.ZipEntry {noformat} Where the header of the table is: {noformat} num #instances #bytes class name {noformat} So it ZipEntry instances altogether is about 173MB, but the first item in the histogram would the char/byte array which also relates to this leak: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" | grep "1:" 1: 26193185752 [B 1:480784 55931000 [C 1:480969 55954072 [C 1:912647 104092392 [C 1:912552 104059536 [C 1: 1354362 153683280 [C 1: 1354332 153673448 [C 1: 1789703 202088704 [C 1: 1789676 202079056 [C 1: 2232868 251789104 [C 1: 2232248 251593392 [C 1: 2667318 300297664 [C 1: 2667203 300256912 [C 1: 3100253 348498384 [C 1: 3100250 348498896 [C 1: 3533763 396801848 [C 1: 3533725 396789720 [C 1: 3967515 445141784 [C 1: 3967459 445128328 [C 1: 4401309 493509768 [C Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 1: 4401236 493496752 [C 1: 4836168 541965464 [C {noformat} This is 541MB. was: I have identified this leak by running the Livy tests (I know it is close to attic but this leak causes a constant OOM there) but it is in our Spark unit tests as well. This leak can be identified by checking the number of ZipEntry instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code (https://github.com/attilapiros/trace-agent) and with that I am able to call JVM diagnostic c
[jira] [Updated] (SPARK-35610) Memory leak in Spark interpreter
[ https://issues.apache.org/jira/browse/SPARK-35610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-35610: --- Description: I have identified this leak by running the Livy tests (I know it is close to attic but this leak causes a constant OOM there) but it is in our Spark unit tests as well. This leak can be identified by checking the number of ZipEntry instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code (https://github.com/attilapiros/trace-agent) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:before,with_gc:true diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:after,with_gc:true {noformat} Which creates a class histogram And the histograms are: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" | grep "ZipEntry" 2:196797 15743760 java.util.zip.ZipEntry 2:196797 15743760 java.util.zip.ZipEntry 2:393594 31487520 java.util.zip.ZipEntry 2:393594 31487520 java.util.zip.ZipEntry 2:590391 47231280 java.util.zip.ZipEntry 2:590391 47231280 java.util.zip.ZipEntry 2:787188 62975040 java.util.zip.ZipEntry 2:787188 62975040 java.util.zip.ZipEntry 2:983985 78718800 java.util.zip.ZipEntry 2:983985 78718800 java.util.zip.ZipEntry 2: 1180782 94462560 java.util.zip.ZipEntry 2: 1180782 94462560 java.util.zip.ZipEntry 2: 1377579 110206320 java.util.zip.ZipEntry 2: 1377579 110206320 java.util.zip.ZipEntry 2: 1574376 125950080 java.util.zip.ZipEntry 2: 1574376 125950080 java.util.zip.ZipEntry 2: 1771173 141693840 java.util.zip.ZipEntry 2: 1771173 141693840 java.util.zip.ZipEntry 2: 1967970 157437600 java.util.zip.ZipEntry Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2: 1967970 157437600 java.util.zip.ZipEntry 2: 2164767 173181360 java.util.zip.ZipEntry {noformat} Where the header of the table is: {noformat} num #instances #bytes class name {noformat} So it ZipEntry instances altogether is about 173MB, but the first item in the histogram would the char/byte array which also relates to this leak: {noformat} $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" | grep "1:" 1: 26193185752 [B 1:480784 55931000 [C 1:480969 55954072 [C 1:912647 104092392 [C 1:912552 104059536 [C 1: 1354362 153683280 [C 1: 1354332 153673448 [C 1: 1789703 202088704 [C 1: 1789676 202079056 [C 1: 2232868 251789104 [C 1: 2232248 251593392 [C 1: 2667318 300297664 [C 1: 2667203 300256912 [C 1: 3100253 348498384 [C 1: 3100250 348498896 [C 1: 3533763 396801848 [C 1: 3533725 396789720 [C 1: 3967515 445141784 [C 1: 3967459 445128328 [C 1: 4401309 493509768 [C Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 1: 4401236 493496752 [C 1: 4836168 541965464 [C {noformat} This is 541MB. was: I have identified this leak by running the Livy tests (I know it is close to attic but this leak causes a constant OOM there) but it is in our Spark unit tests as well. This leak can be identified by checking the number of ZipEntry instances which can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool to instrument JVM code (https://github.com/attilapiros/trace-agent) and with that I am able to call JVM diagnostic commands at specific methods. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt