Andrew Or created SPARK-3015:
--------------------------------

             Summary: Removing broadcast in quick successions causes Akka 
timeout
                 Key: SPARK-3015
                 URL: https://issues.apache.org/jira/browse/SPARK-3015
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.0.2
         Environment: Standalone EC2 Spark shell
            Reporter: Andrew Or
            Priority: Blocker
             Fix For: 1.1.0


This issue is originally reported in SPARK-2916 in the context of MLLib, but we 
were able to reproduce it using a simple Spark shell command:

{code}
(1 to 10000).foreach { i => sc.parallelize(1 to 1000, 48).sum }
{code}

We still do not have a full understanding of the issue, but we have gleaned the 
following information so far. When the driver runs a GC, it attempts to clean 
up all the broadcast blocks that go out of scope at once. This causes the 
driver to send out many blocking RemoveBroadcast messages to the executors, 
which in turn send out blocking UpdateBlockInfo messages back to the driver. 
Both of these calls block until they receive the expected responses. We suspect 
that the high frequency at which we send these blocking messages is the cause 
of either dropped messages or internal deadlock somewhere.

Unfortunately, it is highly difficult to reproduce depending on the 
environment. We have been able to reproduce it on a 6-node cluster in 
us-west-2, but not in us-west-1, for instance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to