[jira] [Commented] (SPARK-46125) Memory leak when using createDataFrame with persist

Igor Berman (Jira) Thu, 25 Jul 2024 00:02:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-46125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868582#comment-17868582
 ]


Igor Berman commented on SPARK-46125:
-------------------------------------

We experiencing kind of same memory leak in long running java spark context. 
What we see in MAT(https://eclipse.dev/mat/) is that cached plans in shared 
state are not cleaned properly
We will try to apply following workaround sparkSession.catalog().clearCache()
we do have multiple registrations of dataframes using jdbc source.
We recently upgraded from 3.1.3 to 3.5.1 and the leak was created somewhen in 
between, since rolling back to 3.1.3 'resolves' it
 !screenshot-1.png! 

> Memory leak when using createDataFrame with persist
> ---------------------------------------------------
>
>                 Key: SPARK-46125
>                 URL: https://issues.apache.org/jira/browse/SPARK-46125
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, PySpark
>    Affects Versions: 3.5.0
>            Reporter: Arman Yazdani
>            Priority: Major
>              Labels: PySpark, memory-leak, persist
>         Attachments: CreateDataFrameWithUnpersist.png, 
> CreateDataFrameWithoutUnpersist.png, ReadParquetWithoutUnpersist.png, 
> image-2023-11-28-12-55-58-461.png, screenshot-1.png
>
>
> When I create a dataset from pandas data frame and persisting it (DISK_ONLY), 
> some "byte[]" objects (total size of imported data frame) will still remain 
> in the driver's heap memory.
> This is the sample code for reproducing it:
> {code:python}
> import pandas as pd
> import gc
> from pyspark.sql import SparkSession
> from pyspark.storagelevel import StorageLevel
> spark = SparkSession.builder \
>     .config("spark.driver.memory", "4g") \
>     .config("spark.executor.memory", "4g") \
>     .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
>     .getOrCreate()
> pdf = pd.read_pickle('tmp/input.pickle')
> df = spark.createDataFrame(pdf)
> df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
> df.count()
> del pdf
> del df
> gc.collect()
> spark.sparkContext._jvm.System.gc(){code}
> After running this code, I will perform a manual GC in VisualVM, but the 
> driver memory usage will remain at 550 MBs (at start it was about 50 MBs).
> !CreateDataFrameWithoutUnpersist.png|width=467,height=349!
> Then I tested with adding {{"df = df.unpersist()"}} after the 
> {{"df.count()"}} line and everything was OK (Memory usage after performing 
> manual GC was about 50 MBs).
> !CreateDataFrameWithUnpersist.png|width=468,height=300!
> Also, I tried with reading from parquet file (without adding unpersist line) 
> with this code:
> {code:python}
> import gc
> from pyspark.sql import SparkSession
> from pyspark.storagelevel import StorageLevel
> spark = SparkSession.builder \
>     .config("spark.driver.memory", "4g") \
>     .config("spark.executor.memory", "4g") \
>     .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
>     .getOrCreate()
> df = spark.read.parquet('tmp/input.parquet')
> df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
> df.count()
> del df
> gc.collect()
> spark.sparkContext._jvm.System.gc(){code}
> Again everything was fine and memory usage was about 50 MBs after performing 
> manual GC.
> !ReadParquetWithoutUnpersist.png|width=473,height=302!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46125) Memory leak when using createDataFrame with persist

Reply via email to