Hi all, I've been doing a bunch of performance measurement of Spark and, as part of doing this, added metrics that record the average CPU utilization, disk throughput and utilization for each block device, and network throughput while each task is running. These metrics are collected by reading the /proc filesystem so work only on Linux. I'm happy to submit a pull request with the appropriate changes but first wanted to see if sufficiently many people think this would be useful. I know the metrics reported by Spark (and in the UI) are already overwhelming to some folks so don't want to add more instrumentation if it's not widely useful.
These metrics are slightly more difficult to interpret for Spark than similar metrics reported by Hadoop because, with Spark, multiple tasks run in the same JVM and therefore as part of the same process. This means that, for example, the CPU utilization metrics reflect the CPU use across all tasks in the JVM, rather than only the CPU time used by the particular task. This is a pro and a con -- it makes it harder to determine why utilization is high (it may be from a different task) but it also makes the metrics useful for diagnosing straggler problems. Just wanted to clarify this before asking folks to weigh in on whether the added metrics would be useful. -Kay (if you're curious, the instrumentation code is on a very messy branch here: https://github.com/kayousterhout/spark-1/tree/proc_logging_perf_minimal_temp/core/src/main/scala/org/apache/spark/performance_logging )