[
https://issues.apache.org/jira/browse/SPARK-6075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Rosen resolved SPARK-6075.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.4.0
Issue resolved by pull request 4835
[https://github.com/apache/spark/pull/4835]
> After SPARK-3885, some tasks' accumulator updates may be lost
> -------------------------------------------------------------
>
> Key: SPARK-6075
> URL: https://issues.apache.org/jira/browse/SPARK-6075
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, Tests
> Affects Versions: 1.4.0
> Reporter: Josh Rosen
> Assignee: Josh Rosen
> Priority: Blocker
> Fix For: 1.4.0
>
>
> It looks like some of the AccumulatorSuite tests have started failing
> nondeterministically on Jenkins. The errors seem to be due to lost / missing
> accumulator updates, e.g.
> {code}
> Set(843, 356, 437, [...], 181, 618, 131) did not contain element 901
> {code}
> This could somehow be related to SPARK-3885 /
> https://github.com/apache/spark/pull/4021, a patch to garbage-collect
> accumulators, which was only merged into master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/lastCompletedBuild/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/org.apache.spark/AccumulatorSuite/add_value_to_collection_accumulators/
> I think I've figured it out: consider the lifecycle of an accumulator in a
> task, say ShuffleMapTask: on the executor, each task deserializes its own
> copy of the RDD inside of its runTask method, so the strong reference to the
> RDD disappears at the end of runTask. In Executor.run(), we call
> Accumulators.values after runTask has exited, so there's a small window in
> which the tasks's RDD can be GC'd, causing accumulators to be GC'd as well
> because there are no longer any strong references to them.
> The fix is to keep strong references in localAccums, since we clear this at
> the end of each task anyways. I'm glad that I was able to figure out
> precisely why this was necessary and sorry that I missed this during review;
> I'll submit a fix shortly. In terms of preventative measures, it might be a
> good idea to write up the lifetime / lifecycle of objects' strong references
> whenever we're using WeakReferences, since the process of explicitly writing
> that out would prevent these sorts of mistakes in the future.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]