[ https://issues.apache.org/jira/browse/SPARK-29321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jungtaek Lim updated SPARK-29321: --------------------------------- Attachment: Screen Shot 2019-10-20 at 10.55.03 PM.png > Possible memory leak in Spark > ----------------------------- > > Key: SPARK-29321 > URL: https://issues.apache.org/jira/browse/SPARK-29321 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.3 > Reporter: George Papa > Priority: Major > Attachments: Screen Shot 2019-10-20 at 10.55.03 PM.png > > > This issue is a clone of the (SPARK-29055). After Spark version 2.3.3, > I{color:#172b4d} observe that the JVM memory is increasing slightly overtime. > This behavior also affects the application performance because when I run my > real application in testing environment, after a while the persisted > dataframes stop fitting into the executors memory and I have spill to > disk.{color} > {color:#172b4d}JVM memory usage (based on htop command){color} > ||Time||RES||SHR||MEM%|| > |1min|{color:#de350b}1349{color}|32724|1.5| > |3min|{color:#de350b}1936{color}|32724|2.2| > |5min|{color:#de350b}2506{color}|32724|2.6| > |7min|{color:#de350b}2564{color}|32724|2.7| > |9min|{color:#de350b}2584{color}|32724|2.7| > |11min|{color:#de350b}2585{color}|32724|2.7| > |13min|{color:#de350b}2592{color}|32724|2.7| > |15min|{color:#de350b}2591{color}|32724|2.7| > |17min|{color:#de350b}2591{color}|32724|2.7| > |30min|{color:#de350b}2600{color}|32724|2.7| > |1h|{color:#de350b}2618{color}|32724|2.7| > > *HOW TO REPRODUCE THIS BEHAVIOR:* > Reproduce the above behavior, by running the snippet code (I prefer to run > without any sleeping delay) and track the JVM memory with top or htop command. > {code:java} > import time > import os > from pyspark.sql import SparkSession > from pyspark.sql import functions as F > from pyspark.sql import types as T > target_dir = "..." > spark=SparkSession.builder.appName("DataframeCount").getOrCreate() > while True: > for f in os.listdir(target_dir): > df = spark.read.load(target_dir + f, format="csv") > print("Number of records: {0}".format(df.count())) > time.sleep(15){code} > > *TESTED CASES WITH THE SAME BEHAVIOUR:* > * I tested with default settings (spark-defaults.conf) > * Add spark.cleaner.periodicGC.interval 1min (or less) > * {{Turn spark.cleaner.referenceTracking.blocking}}=false > * Run the application in cluster mode > * Increase/decrease the resources of the executors and driver > * I tested with extraJavaOptions in driver and executor -XX:+UseG1GC > -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12 > * It is also tested with the Spark 2.4.4 (latest) and had the same behavior. > > *DEPENDENCIES* > * Operation system: Ubuntu 16.04.3 LTS > * Java: jdk1.8.0_131 (tested also with jdk1.8.0_221) > * Python: Python 2.7.12 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org