[ https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-17449: ------------------------------ Priority: Minor (was: Major) Component/s: (was: Spark Core) Documentation Issue Type: Improvement (was: Bug) I see, spark.executor.heartbeatInterval is certainly intended to be much smaller than spark.network.timeout. You could note that in the docs. > Relation between heartbeatInterval and network timeout > ------------------------------------------------------ > > Key: SPARK-17449 > URL: https://issues.apache.org/jira/browse/SPARK-17449 > Project: Spark > Issue Type: Improvement > Components: Documentation > Reporter: Yang Liang > Priority: Minor > > $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s > --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 > ms exceeds timeout 120000 ms > ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed > out after 168136 ms > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 > ms exceeds timeout 10000 ms > ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed > out after 11949 m > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 > ms exceeds timeout 10000 ms > ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed > out after 39299 ms > Source Code: > spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala > /** > * A heartbeat from executors to the driver. This is a shared message used by > several internal > * components to convey liveness or execution information for in-progress > tasks. It will also > * expire the hosts that have not heartbeated for more than > spark.network.timeout. > */ > private val executorTimeoutMs = > sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") > * 1000 > The relation between spark.network.timeout and > spark.executor.heartbeatInterval should be mentioned in the document at > least. Otherwise error above would be confusing. Do some checks when get > settings ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org