[ https://issues.apache.org/jira/browse/SPARK-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick Wendell resolved SPARK-3015. ------------------------------------ Resolution: Fixed Issue resolved by pull request 1931 [https://github.com/apache/spark/pull/1931] > Removing broadcast in quick successions causes Akka timeout > ----------------------------------------------------------- > > Key: SPARK-3015 > URL: https://issues.apache.org/jira/browse/SPARK-3015 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.0.2 > Environment: Standalone EC2 Spark shell > Reporter: Andrew Or > Priority: Blocker > Fix For: 1.1.0 > > > This issue is originally reported in SPARK-2916 in the context of MLLib, but > we were able to reproduce it using a simple Spark shell command: > {code} > (1 to 10000).foreach { i => sc.parallelize(1 to 1000, 48).sum } > {code} > We still do not have a full understanding of the issue, but we have gleaned > the following information so far. When the driver runs a GC, it attempts to > clean up all the broadcast blocks that go out of scope at once. This causes > the driver to send out many blocking RemoveBroadcast messages to the > executors, which in turn send out blocking UpdateBlockInfo messages back to > the driver. Both of these calls block until they receive the expected > responses. We suspect that the high frequency at which we send these blocking > messages is the cause of either dropped messages or internal deadlock > somewhere. > Unfortunately, it is highly difficult to reproduce depending on the > environment. We have been able to reproduce it on a 6-node cluster in > us-west-2, but not in us-west-1, for instance. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org