[ 
https://issues.apache.org/jira/browse/SPARK-46125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arman Yazdani updated SPARK-46125:
----------------------------------
    Attachment: CreateDataFrameWithUnpersist.png
                ReadParquetWithoutUnpersist.png
                CreateDataFrameWithoutUnpersist.png

> Memory leak when using createDataFrame with persist
> ---------------------------------------------------
>
>                 Key: SPARK-46125
>                 URL: https://issues.apache.org/jira/browse/SPARK-46125
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, PySpark
>    Affects Versions: 3.5.0
>            Reporter: Arman Yazdani
>            Priority: Major
>              Labels: PySpark, memory-leak, persist
>         Attachments: CreateDataFrameWithUnpersist.png, 
> CreateDataFrameWithoutUnpersist.png, ReadParquetWithoutUnpersist.png
>
>
> When I create a dataset from pandas data frame and persisting it (DISK_ONLY), 
> some "byte[]" objects (total size of imported data frame) will still remain 
> in the driver's heap memory.
> This is the sample code for reproducing it:
> {code:python}
> import pandas as pd
> import gc
> from pyspark.sql import SparkSession
> from pyspark.storagelevel import StorageLevel
> spark = SparkSession.builder \
>     .config("spark.driver.memory", "4g") \
>     .config("spark.executor.memory", "4g") \
>     .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
>     .getOrCreate()
> pdf = pd.read_pickle('tmp/input.pickle')
> df = spark.createDataFrame(pdf)
> df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
> df.count()
> del pdf
> del df
> gc.collect()
> spark.sparkContext._jvm.System.gc(){code}
> After running this code, I will perform a manual GC in VisualVM, but the 
> driver memory usage will remain at 550 MBs (at start it was about 50 MBs).
> Then I tested with adding {{"df = df.unpersist()"}} after the 
> {{"df.count()"}} line and everything was OK (Memory usage after performing 
> manual GC was about 50 MBs).
> Also, I tried with reading from parquet file (without adding unpersist line) 
> with this code:
> {code:python}
> import gc
> from pyspark.sql import SparkSession
> from pyspark.storagelevel import StorageLevel
> spark = SparkSession.builder \
>     .config("spark.driver.memory", "4g") \
>     .config("spark.executor.memory", "4g") \
>     .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
>     .getOrCreate()
> df = spark.read.parquet('tmp/input.parquet')
> df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
> df.count()
> del df
> gc.collect()
> spark.sparkContext._jvm.System.gc(){code}
> Again everything was fine and memory usage was about 50 MBs after performing 
> manual GC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to