[ 
https://issues.apache.org/jira/browse/SPARK-29321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Papa updated SPARK-29321:
--------------------------------
    Description: 
This issue is a clone of the (SPARK-29055). After Spark version 2.3.3,  
I{color:#172b4d} observe that the JVM memory is increasing slightly overtime. 
This behavior also affects the application performance because the persisted 
dataframes don't fit in the executors memory and I have spill to disk.{color}
 

*HOW TO REPRODUCE THIS BEHAVIOR:*

 Reproduce the above behavior, by running the snippet code (I prefer to run 
without any sleeping delay) and track the JVM memory with top or htop command
{code:java}
import time
import os

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

target_dir = "..."

spark=SparkSession.builder.appName("DataframeCount").getOrCreate()

while True:
    for f in os.listdir(target_dir):
        df = spark.read.load(target_dir + f, format="csv")
        print("Number of records: {0}".format(df.count()))
        time.sleep(15){code}
 

 

*TESTED CASES WITH THE SAME BEHAVIOUR:*
 * I tested with default settings (spark-defaults.conf)
 * Add spark.cleaner.periodicGC.interval 1min (or less)
 * {{Turn spark.cleaner.referenceTracking.blocking}}=false
 * Run the application in cluster mode
 * Increase/decrease the resources of the executors and driver
 * I tested with extraJavaOptions in driver and executor -XX:+UseG1GC 
-XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
  

*DEPENDENCIES*
 * Operation system: Ubuntu 16.04.3 LTS
 * Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
 * Python: Python 2.7.12

  was:
This issue is a clone of the (SPARK-29055). After Spark version 2.3.3,  
I{color:#ff0000}{color:#172b4d} observe that the JVM memory is increasing 
slightly overtime. This behavior also affects the application performance 
because the persisted dataframes don't fit in the executors memory and I have 
spill to disk.{color}
{color}

In more detail, the driver memory and executors memory have the same used 
memory storage and after each iteration the storage memory is increasing. You 
can reproduce this behavior by running the following snippet code. The 
following example, is very simple, without any dataframe persistence, but the 
memory consumption is not stable as it was in former Spark versions 
(Specifically until Spark 2.3.2).

Also, I tested with Spark streaming and structured streaming API and I had the 
same behavior. I tested with an existing application which reads from Kafka 
source and do some aggregations, persist dataframes and then unpersist them. 
The persist and unpersist it works correct, I see the dataframes in the storage 
tab in Spark UI and after the unpersist, all dataframe have removed. But, after 
the unpersist the executors memory is not zero, BUT has the same value with the 
driver memory. This behavior also affects the application performance because 
the memory of the executors is increasing as the driver increasing and after a 
while the persisted dataframes are not fit in the executors memory and  I have 
spill to disk.

Another error which I had after a long running, was 
{color:#ff0000}java.lang.OutOfMemoryError: GC overhead limit exceeded, but I 
don't know if its relevant with the above behavior or not.{color}

 

*HOW TO REPRODUCE THIS BEHAVIOR:*

Create a very simple application(streaming count_file.py) in order to reproduce 
this behavior. This application reads CSV files from a directory and count the 
rows.

 
{code:java}
import time
import os

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

target_dir = "..."

spark=SparkSession.builder.appName("DataframeCount").getOrCreate()

while True:
    for f in os.listdir(target_dir):
        df = spark.read.load(target_dir + f, format="csv")
        print("Number of records: {0}".format(df.count()))
        time.sleep(15){code}
Submit code:
{code:java}
spark-submit 
--master spark://xxx.xxx.xx.xxx
--deploy-mode client
--executor-memory 4g
--executor-cores 3
streaming count_file.py
{code}
 

*TESTED CASES WITH THE SAME BEHAVIOUR:*
 * I tested with default settings (spark-defaults.conf)
 * Add spark.cleaner.periodicGC.interval 1min (or less)
 * {{Turn spark.cleaner.referenceTracking.blocking}}=false
 * Run the application in cluster mode
 * Increase/decrease the resources of the executors and driver
 * I tested with extraJavaOptions in driver and executor -XX:+UseG1GC 
-XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
  

*DEPENDENCIES*
 * Operation system: Ubuntu 16.04.3 LTS
 * Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
 * Python: Python 2.7.12

 

*NOTE:* In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was 
extremely low and after the run of ContextCleaner and BlockManager the memory 
was decreasing.


> Possible memory leak in Spark
> -----------------------------
>
>                 Key: SPARK-29321
>                 URL: https://issues.apache.org/jira/browse/SPARK-29321
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.3
>            Reporter: George Papa
>            Priority: Major
>
> This issue is a clone of the (SPARK-29055). After Spark version 2.3.3,  
> I{color:#172b4d} observe that the JVM memory is increasing slightly overtime. 
> This behavior also affects the application performance because the persisted 
> dataframes don't fit in the executors memory and I have spill to disk.{color}
>  
> *HOW TO REPRODUCE THIS BEHAVIOR:*
>  Reproduce the above behavior, by running the snippet code (I prefer to run 
> without any sleeping delay) and track the JVM memory with top or htop command
> {code:java}
> import time
> import os
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql import types as T
> target_dir = "..."
> spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
> while True:
>     for f in os.listdir(target_dir):
>         df = spark.read.load(target_dir + f, format="csv")
>         print("Number of records: {0}".format(df.count()))
>         time.sleep(15){code}
>  
>  
> *TESTED CASES WITH THE SAME BEHAVIOUR:*
>  * I tested with default settings (spark-defaults.conf)
>  * Add spark.cleaner.periodicGC.interval 1min (or less)
>  * {{Turn spark.cleaner.referenceTracking.blocking}}=false
>  * Run the application in cluster mode
>  * Increase/decrease the resources of the executors and driver
>  * I tested with extraJavaOptions in driver and executor -XX:+UseG1GC 
> -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
>   
> *DEPENDENCIES*
>  * Operation system: Ubuntu 16.04.3 LTS
>  * Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
>  * Python: Python 2.7.12



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to