[ 
https://issues.apache.org/jira/browse/SPARK-52516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-52516:
-----------------------------------

    Assignee: L. C. Hsieh

> Memory Leak with coalesce foreachpartitions and v2 datasources
> --------------------------------------------------------------
>
>                 Key: SPARK-52516
>                 URL: https://issues.apache.org/jira/browse/SPARK-52516
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.4.3
>            Reporter: Joshua Kolash
>            Assignee: L. C. Hsieh
>            Priority: Major
>              Labels: pull-request-available
>
> Doing the following should not leak any significant amount of memory.
> {code:java}
> sparkSession.sql("select * from 
> icebergcatalog.db.table").coalesce(4).foreachPartition( 
>     (iterator) -> { while (iterator.hasNext()) iterator.next(); }
> ); {code}
> Some of the details of this are contained in this thread here
> [https://github.com/apache/iceberg/issues/13297]
> In summary there is a bug where adding a heavy reference in 
> {code:java}
> context.addTaskCompletionListener{code}
> can lead to an OOM as the callback is preventing garbage collection of those 
> heavy references. In particular doing a coalesce piles up "sub-tasks" such 
> that they cannot be cleaned up until the coalesce task completes.
> This same issue manifested in 2 different scala classes 
> {code:java}
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
>  {code}
> Iceberg is affected by the first but using the v2 parquet readers are 
> affected by the 2nd.
> The proposed solution is to use a delegate class to de-reference the heavy 
> objects on iterator exhaustion or close. Which only requires changes local to 
> those classes without any public api changes.
> The proposed changes were tested on spark 3.4.X but not on 4.0.0 But I 
> believe 4.0.0 is likely impacted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to