[ 
https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623834#comment-14623834
 ] 

Abhishek Modi commented on SPARK-9004:
--------------------------------------

Hadoop separates HDFS bytes, local filesystem bytes and S3 bytes in counters. 
Spark combines all of them in its metrics. Separating them could give a better 
idea of IO distribution.

Here's how it works in MR: 

1. Client creates a Job object (org.apache.hadoop.mapreduce.Job). It submits to 
the RM which then launches the AM etc.
2. After job submission, Client continuously monitors the job to see if it is 
finished. 
3. Once the job is finished, the client gets the counters of the job via the 
getCounters() function. 
4. It logs on the client using "Counters=" format.

I don't really know how to implement it. Can it be done by modifying 
NewHadoopRDD because i guess that's where the Job object is being used ?


> Add s3 bytes read/written metrics
> ---------------------------------
>
>                 Key: SPARK-9004
>                 URL: https://issues.apache.org/jira/browse/SPARK-9004
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Abhishek Modi
>            Priority: Minor
>
> s3 read/write metrics can be pretty useful in finding the total aggregate 
> data processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to