[ 
https://issues.apache.org/jira/browse/SPARK-12411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12411:
------------------------------
    Fix Version/s: 1.6.1

> Reconsider executor heartbeats rpc timeout
> ------------------------------------------
>
>                 Key: SPARK-12411
>                 URL: https://issues.apache.org/jira/browse/SPARK-12411
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Nong Li
>            Assignee: Nong Li
>             Fix For: 1.6.1, 2.0.0
>
>
> Currently, the timeout for checking when an executor is failed is the same as 
> the timeout of the sender ("spark.network.timeout") which defaults to 120s. 
> This means if there is a network issue, the executor only gets one try to 
> heartbeat which probably causes the failure detection to be flaky. 
> The executor has a config to control how often to heartbeat 
> (spark.executor.heartbeatInterval) which defaults to 10s. This combination of 
> configs doesn't seem to make sense. The heartbeat rpc timeout should probably 
> be less than or equal to the heartbeatInterval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to