[jira] [Updated] (SPARK-52516) Memory Leak with coalesce foreachpartitions and v2 datasources

Joshua Kolash (Jira) Tue, 17 Jun 2025 12:39:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-52516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joshua Kolash updated SPARK-52516:
----------------------------------
    Description: 
Doing the following should not leak any significant amount of memory.
{code:java}
sparkSession.sql("select * from 
icebergcatalog.db.table").coalesce(4).foreachPartition( 
    (iterator) -> { while (iterator.hasNext()) iterator.next(); }
); {code}
Some of the details of this are contained in this thread here

[https://github.com/apache/iceberg/issues/13297]

In summary there is a bug where adding a heavy reference in 
{code:java}
context.addTaskCompletionListener{code}
can lead to an OOM as the callback is preventing garbage collection of those 
heavy references. In particular doing a coalesce piles up "sub-tasks" such that 
they cannot be cleaned up until the coalesce task completes.

This same issue manifested in 2 different scala classes 
{code:java}
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
 {code}
Iceberg is affected by the first but using the v2 parquet readers are affected 
by the 2nd.

The proposed solution is to use a delegate class to de-reference the heavy 
objects on iterator exhaustion or close. Which only requires changes local to 
those classes without any public api changes.

The proposed changes were tested on spark 3.4.X but not on 4.0.0 But I believe 
4.0.0 is likely impacted.

  was:
Doing the following should not leak any significant amount of memory.
{code:java}
sparkSession.sql("select * from 
icebergcatalog.db.table").coalesce(4).foreachPartition( (iterator) ->
{ while (iterator.hasNext()) iterator.next(); }
); {code}
Some of the details of this are contained in this thread here

[https://github.com/apache/iceberg/issues/13297]

In summary there is a bug where adding a heavy reference in 
{code:java}
context.addTaskCompletionListener{code}
can lead to an OOM as the callback is preventing garbage collection of those 
heavy references. In particular doing a coalesce piles up "sub-tasks" such that 
they cannot be cleaned up until the coalesce task completes.

This same issue manifested in 2 different scala classes 
{code:java}
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
 {code}
Iceberg is affected by the first but using the v2 parquet readers are affected 
by the 2nd.

The proposed solution is to use a delegate class to de-reference the heavy 
objects on iterator exhaustion or close. Which only requires changes local to 
those classes without any public api changes.

The proposed changes were tested on spark 3.4.X but not on 4.0.0 But I believe 
4.0.0 is likely impacted.


> Memory Leak with coalesce foreachpartitions and v2 datasources
> --------------------------------------------------------------
>
>                 Key: SPARK-52516
>                 URL: https://issues.apache.org/jira/browse/SPARK-52516
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.4.3
>            Reporter: Joshua Kolash
>            Priority: Major
>
> Doing the following should not leak any significant amount of memory.
> {code:java}
> sparkSession.sql("select * from 
> icebergcatalog.db.table").coalesce(4).foreachPartition( 
>     (iterator) -> { while (iterator.hasNext()) iterator.next(); }
> ); {code}
> Some of the details of this are contained in this thread here
> [https://github.com/apache/iceberg/issues/13297]
> In summary there is a bug where adding a heavy reference in 
> {code:java}
> context.addTaskCompletionListener{code}
> can lead to an OOM as the callback is preventing garbage collection of those 
> heavy references. In particular doing a coalesce piles up "sub-tasks" such 
> that they cannot be cleaned up until the coalesce task completes.
> This same issue manifested in 2 different scala classes 
> {code:java}
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
>  {code}
> Iceberg is affected by the first but using the v2 parquet readers are 
> affected by the 2nd.
> The proposed solution is to use a delegate class to de-reference the heavy 
> objects on iterator exhaustion or close. Which only requires changes local to 
> those classes without any public api changes.
> The proposed changes were tested on spark 3.4.X but not on 4.0.0 But I 
> believe 4.0.0 is likely impacted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-52516) Memory Leak with coalesce foreachpartitions and v2 datasources

Reply via email to