[ https://issues.apache.org/jira/browse/SPARK-25103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-25103. ---------------------------------- Resolution: Incomplete > CompletionIterator may delay GC of completed resources > ------------------------------------------------------ > > Key: SPARK-25103 > URL: https://issues.apache.org/jira/browse/SPARK-25103 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.0.1, 2.1.0, 2.2.0, 2.3.0 > Reporter: Eyal Farago > Priority: Major > Labels: bulk-closed > > while working on SPARK-22713 , I fund (and partially fixed) a scenario in > which an iterator is already exhausted but still holds a reference to some > resources that can be GCed at this point. > However, these resources can not be GCed because of this reference. > the specific fix applied in SPARK-22713 was to wrap the iterator with a > CompletionIterator that cleans it when exhausted, thing is that it's quite > easy to get this wrong by closing over local variables or _this_ reference in > the cleanup function itself. > I propose solving this by modifying CompletionIterator to discard references > to the wrapped iterator and cleanup function once exhausted. > > * a dive into the code showed that most CompletionIterators are eventually > used by > {code:java} > org.apache.spark.scheduler.ShuffleMapTask#runTask{code} > which does: > {code:java} > writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: > Product2[Any, Any]]]){code} > looking at > {code:java} > org.apache.spark.shuffle.ShuffleWriter#write{code} > implementations, it seems all of them first exhaust the iterator and then > perform some kind of post-processing: i.e. merging spills, sorting, writing > partitions files and then concatenating them into a single file... bottom > line the Iterator may actually be 'sitting' for some time after being > exhausted. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org