[ 
https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17449:
------------------------------
    Labels:   (was: bug)

> Relation between heartbeatInterval and network timeout
> ------------------------------------------------------
>
>                 Key: SPARK-17449
>                 URL: https://issues.apache.org/jira/browse/SPARK-17449
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Yang Liang
>            Priority: Minor
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
> --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
> ms exceeds timeout 120000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
> out after 168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 
> ms exceeds timeout 10000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
> out after 11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 
> ms exceeds timeout 10000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
> out after 39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> /**
>  * A heartbeat from executors to the driver. This is a shared message used by 
> several internal
>  * components to convey liveness or execution information for in-progress 
> tasks. It will also
>  * expire the hosts that have not heartbeated for more than 
> spark.network.timeout.
>  */
> private val executorTimeoutMs =
>     sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") 
> * 1000
> The relation between spark.network.timeout and 
> spark.executor.heartbeatInterval should be mentioned in the document at 
> least. Otherwise error above would be confusing. Do some checks when get 
> settings ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to