[jira] [Commented] (SPARK-9004) Add s3 bytes read/written metrics
[ https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572414#comment-15572414 ] Steve Loughran commented on SPARK-9004: --- HADOOP-13605 added a whole new set of counters for HDFS, S3 and hopefully soon Azure; there's an API call on the FS {{getStorageStatistics()}} to query these. One problem though: this isn't shipping in Hadoop branch-2 yet, so you can't write code that uses it, not unless there's some introspection/plugin mechanism. All the stats are just {{name: String -> value: Long}}, so a something to collect a {{Map[String, Long]}} would work. > Add s3 bytes read/written metrics > - > > Key: SPARK-9004 > URL: https://issues.apache.org/jira/browse/SPARK-9004 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Abhishek Modi >Priority: Minor > > s3 read/write metrics can be pretty useful in finding the total aggregate > data processed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9004) Add s3 bytes read/written metrics
[ https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432462#comment-15432462 ] Steve Loughran commented on SPARK-9004: --- If you know the filesystem, you can get summary stats from {{FileSystem.getStatistics()}}; they'd have to be collected across all the executors These counters are per-JVM, not isolated into individual jobs > Add s3 bytes read/written metrics > - > > Key: SPARK-9004 > URL: https://issues.apache.org/jira/browse/SPARK-9004 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Abhishek Modi >Priority: Minor > > s3 read/write metrics can be pretty useful in finding the total aggregate > data processed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9004) Add s3 bytes read/written metrics
[ https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623834#comment-14623834 ] Abhishek Modi commented on SPARK-9004: -- Hadoop separates HDFS bytes, local filesystem bytes and S3 bytes in counters. Spark combines all of them in its metrics. Separating them could give a better idea of IO distribution. Here's how it works in MR: 1. Client creates a Job object (org.apache.hadoop.mapreduce.Job). It submits to the RM which then launches the AM etc. 2. After job submission, Client continuously monitors the job to see if it is finished. 3. Once the job is finished, the client gets the counters of the job via the getCounters() function. 4. It logs on the client using "Counters=" format. I don't really know how to implement it. Can it be done by modifying NewHadoopRDD because i guess that's where the Job object is being used ? > Add s3 bytes read/written metrics > - > > Key: SPARK-9004 > URL: https://issues.apache.org/jira/browse/SPARK-9004 > Project: Spark > Issue Type: Improvement >Reporter: Abhishek Modi >Priority: Minor > > s3 read/write metrics can be pretty useful in finding the total aggregate > data processed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org