Joshua Kolash created SPARK-52516:
-------------------------------------

             Summary: Memory Leak with coalesce foreachpartitions and v2 
datasources
                 Key: SPARK-52516
                 URL: https://issues.apache.org/jira/browse/SPARK-52516
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.4.3
            Reporter: Joshua Kolash


Doing the following should not leak any significant amount of memory.
sparkSession.sql("select * from 
icebergcatalog.db.table").coalesce(4).foreachPartition( (iterator) -> { while 
(iterator.hasNext()) iterator.next();
});
Some of the details of this are contained in this thread here

[https://github.com/apache/iceberg/issues/13297]

In summary there is a bug where adding a heavy reference in 
{code:java}
context.addTaskCompletionListener{code}
can lead to an OOM as the callback is preventing garbage collection of those 
heavy references. In particular doing a coalesce piles up "sub-tasks" such that 
they cannot be cleaned up until the coalesce task completes.

This same issue manifested in 2 different scala classes 
{code:java}
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
 {code}
Iceberg is affected by the first but using the v2 parquet readers are affected 
by the 2nd.

The proposed solution is to use a delegate class to de-reference the heavy 
objects on iterator exhaustion or close. Which only requires changes local to 
those classes without any public api changes.

The proposed changes were tested on spark 3.4.X but not on 4.0.0 But I believe 
4.0.0 is likely impacted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to