[ https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yongjia Wang updated SPARK-10912: --------------------------------- Attachment: s3a_metrics.patch Adding s3a is fairly straightforward. I guess the reason it's not included is because s3a support (via hadoop-aws.jar) is not part of the default hadoop distribution due to licensing issues. I created a patch to enable s3a metrics, both on the executors and on the driver. Reporting shuffle statistics requires more thoughts, although all the numbers are already collected in TaskMetrics.scala (input, output, shuffle, local, remote, spill, records, bytes, etc). I think it would make sense to report the aggregated metrics per executor across all tasks, so it's easy to have an overall sense of disk I/O and network traffic. > Improve Spark metrics executor.filesystem > ----------------------------------------- > > Key: SPARK-10912 > URL: https://issues.apache.org/jira/browse/SPARK-10912 > Project: Spark > Issue Type: Improvement > Components: Deploy > Affects Versions: 1.5.0 > Reporter: Yongjia Wang > Priority: Minor > Attachments: s3a_metrics.patch > > > In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: > "hdfs" and "file". I started using s3 as the persistent storage with Spark > standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. > The 'file' metric appears to be only for driver reading local file, it would > be nice to also report shuffle read/write metrics, so it can help with > optimization. > I think these 2 things (s3 and shuffle) are very useful and cover all the > missing information about Spark IO especially for s3 setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org