I think I found my answer at https://github.com/kayousterhout/trace-analysis:
"One thing to keep in mind is that Spark does not currently include instrumentation to measure the time spent reading input data from disk or writing job output to disk (the `Output write wait'' shown in the waterfall is time to write shuffle output to disk, which Spark does have instrumentation for); as a result, the time shown asCompute' may include time using the disk. We have a custom Hadoop branch that measures the time Hadoop spends transferring data to/from disk, and we are hopeful that similar timing metrics will someday be included in the Hadoop FileStatistics API. In the meantime, it is not currently possible to understand how much of a Spark task's time is spent reading from disk via HDFS." That said, this might be posted as a footnote at the event timeline to avoid confusion :) Best regards, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862p23865.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org