[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

Aaron Davidson (JIRA) Sun, 20 Jul 2014 16:49:24 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068083#comment-14068083
 ]


Aaron Davidson commented on SPARK-2282:
---------------------------------------

Hey Ken,

I created [PR 1503|https://github.com/apache/spark/pull/1503] to implement the 
solution I mentioned. It would be great if you could try testing out this patch 
on your cluster.

While testing, I noticed that the ephemeral ports were still growing with 
number of tasks due to how we launch new tasks on the PySpark daemon. However, 
this should only affect workers, and the rate of buildup should be divided by 
the number of workers. In other words, it should only ever be a problem on a 
very small cluster.

> PySpark crashes if too many tasks complete quickly
> --------------------------------------------------
>
>                 Key: SPARK-2282
>                 URL: https://issues.apache.org/jira/browse/SPARK-2282
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 0.9.1, 1.0.0, 1.0.1
>            Reporter: Aaron Davidson
>            Assignee: Aaron Davidson
>             Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

Reply via email to