[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063211#comment-14063211
 ] 

Aaron Davidson commented on SPARK-2282:
---------------------------------------

This should actually only be necessary on the master. Use of the SO_REUSEADDR 
property (equivalently, sysctl tcp_tw_reuse) means that the number of used 
sockets will increase to the maximum number of ephemeral ports, but then should 
remain constant. It's possible that if another process tries to allocate an 
ephemeral port during this time, it will fail.

While tcp_tw_reuse is generally considered "safe", setting *tcp_tw_recycle* can 
lead to unexpected packet arrival from closed streams (though it's very 
unlikely), but is a more guaranteed solution. This should cause the connections 
to be recycled immediately after the TCP teardown, and thus no buildup of 
sockets should occur.

Please let me know if setting either of these parameters helps on the driver 
machine. You can also verify that this problem is occurring by doing a 
{{netstat -lpn}} during execution, iirc, which should display an inordinate 
number of open connections on the Spark Driver process and on a Python daemon 
one.

> PySpark crashes if too many tasks complete quickly
> --------------------------------------------------
>
>                 Key: SPARK-2282
>                 URL: https://issues.apache.org/jira/browse/SPARK-2282
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 0.9.1, 1.0.0, 1.0.1
>            Reporter: Aaron Davidson
>            Assignee: Aaron Davidson
>             Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to