[ https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791960#comment-16791960 ]
Shixiong Zhu commented on SPARK-25449: -------------------------------------- I think this patch actually fixed a bug introduced by https://github.com/apache/spark/commit/0514e8d4b69615ba8918649e7e3c46b5713b6540 It didn't use the correct default timeout. Before this batch, using `spark.executor.heartbeatInterval 30` would send a heartbeat every 30 ms, but each heartbeat RPC message timeout was 30 seconds. This patch just unifies the default time unit in all usages of "spark.executor.heartbeatInterval". > Don't send zero accumulators in heartbeats > ------------------------------------------ > > Key: SPARK-25449 > URL: https://issues.apache.org/jira/browse/SPARK-25449 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.4.0 > Reporter: Mukul Murthy > Assignee: Mukul Murthy > Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Heartbeats sent from executors to the driver every 10 seconds contain metrics > and are generally on the order of a few KBs. However, for large jobs with > lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks > to die with heartbeat failures. We can mitigate this by not sending zero > metrics to the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org