Joshua Kolash created SPARK-52516:
-------------------------------------
Summary: Memory Leak with coalesce foreachpartitions and v2
datasources
Key: SPARK-52516
URL: https://issues.apache.org/jira/browse/SPARK-52516
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.4.3
Reporter: Joshua Kolash
Doing the following should not leak any significant amount of memory.
sparkSession.sql("select * from
icebergcatalog.db.table").coalesce(4).foreachPartition( (iterator) -> { while
(iterator.hasNext()) iterator.next();
});
Some of the details of this are contained in this thread here
[https://github.com/apache/iceberg/issues/13297]
In summary there is a bug where adding a heavy reference inÂ
{code:java}
context.addTaskCompletionListener{code}
can lead to an OOM as the callback is preventing garbage collection of those
heavy references. In particular doing a coalesce piles up "sub-tasks" such that
they cannot be cleaned up until the coalesce task completes.
This same issue manifested in 2 different scala classesÂ
{code:java}
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
{code}
Iceberg is affected by the first but using the v2 parquet readers are affected
by the 2nd.
The proposed solution is to use a delegate class to de-reference the heavy
objects on iterator exhaustion or close. Which only requires changes local to
those classes without any public api changes.
The proposed changes were tested on spark 3.4.X but not on 4.0.0 But I believe
4.0.0 is likely impacted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]