[ https://issues.apache.org/jira/browse/SPARK-26329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li updated SPARK-26329: ---------------------------- Labels: release-notes (was: ) > ExecutorMetrics should poll faster than heartbeats > -------------------------------------------------- > > Key: SPARK-26329 > URL: https://issues.apache.org/jira/browse/SPARK-26329 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI > Affects Versions: 3.0.0 > Reporter: Imran Rashid > Assignee: Wing Yew Poon > Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > We should allow faster polling of the executor memory metrics (SPARK-23429 / > SPARK-23206) without requiring a faster heartbeat rate. We've seen the > memory usage of executors pike over 1 GB in less than a second, but > heartbeats are only every 10 seconds (by default). Spark needs to enable > fast polling to capture these peaks, without causing too much strain on the > system. > In the current implementation, the metrics are polled along with the > heartbeat, but this leads to a slow rate of polling metrics by default. If > users were to increase the rate of the heartbeat, they risk overloading the > driver on a large cluster, with too many messages and too much work to > aggregate the metrics. But, the executor could poll the metrics more > frequently, and still only send the *max* since the last heartbeat for each > metric. This keeps the load on the driver the same, and only introduces a > small overhead on the executor to grab the metrics and keep the max. > The downside of this approach is that we still need to wait for the next > heartbeat for the driver to be aware of the new peak. If the executor dies > or is killed before then, then we won't find out. A potential future > enhancement would be to send an update *anytime* there is an increase by some > percentage, but we'll leave that out for now. > Another possibility would be to change the metrics themselves to track peaks > for us, so we don't have to fine-tune the polling rate. For example, some > jvm metrics provide a usage threshold, and notification: > https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#UsageThreshold > But, that is not available on all metrics. This proposal gives us a generic > way to get a more accurate peak memory usage for *all* metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org