[ https://issues.apache.org/jira/browse/SPARK-12411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Or updated SPARK-12411: ------------------------------ Fix Version/s: 1.6.1 > Reconsider executor heartbeats rpc timeout > ------------------------------------------ > > Key: SPARK-12411 > URL: https://issues.apache.org/jira/browse/SPARK-12411 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Nong Li > Assignee: Nong Li > Fix For: 1.6.1, 2.0.0 > > > Currently, the timeout for checking when an executor is failed is the same as > the timeout of the sender ("spark.network.timeout") which defaults to 120s. > This means if there is a network issue, the executor only gets one try to > heartbeat which probably causes the failure detection to be flaky. > The executor has a config to control how often to heartbeat > (spark.executor.heartbeatInterval) which defaults to 10s. This combination of > configs doesn't seem to make sense. The heartbeat rpc timeout should probably > be less than or equal to the heartbeatInterval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org