[ https://issues.apache.org/jira/browse/SPARK-52516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
L. C. Hsieh reassigned SPARK-52516: ----------------------------------- Assignee: L. C. Hsieh > Memory Leak with coalesce foreachpartitions and v2 datasources > -------------------------------------------------------------- > > Key: SPARK-52516 > URL: https://issues.apache.org/jira/browse/SPARK-52516 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.4.3 > Reporter: Joshua Kolash > Assignee: L. C. Hsieh > Priority: Major > Labels: pull-request-available > > Doing the following should not leak any significant amount of memory. > {code:java} > sparkSession.sql("select * from > icebergcatalog.db.table").coalesce(4).foreachPartition( > (iterator) -> { while (iterator.hasNext()) iterator.next(); } > ); {code} > Some of the details of this are contained in this thread here > [https://github.com/apache/iceberg/issues/13297] > In summary there is a bug where adding a heavy reference in > {code:java} > context.addTaskCompletionListener{code} > can lead to an OOM as the callback is preventing garbage collection of those > heavy references. In particular doing a coalesce piles up "sub-tasks" such > that they cannot be cleaned up until the coalesce task completes. > This same issue manifested in 2 different scala classes > {code:java} > sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala > sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala > {code} > Iceberg is affected by the first but using the v2 parquet readers are > affected by the 2nd. > The proposed solution is to use a delegate class to de-reference the heavy > objects on iterator exhaustion or close. Which only requires changes local to > those classes without any public api changes. > The proposed changes were tested on spark 3.4.X but not on 4.0.0 But I > believe 4.0.0 is likely impacted. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org