steveloughran commented on pull request #29179:
URL: https://github.com/apache/spark/pull/29179#issuecomment-663523126


   > Interesting. Is this specific to the S3A impl or is there a higher base 
class? I want to make it work with multiple file formats if possible.
   
   it's in hadoop common with an interface IOStatisticsSource which can be 
implemented by anything that feels like it; there's passthrough in the core 
hadoop io stream/compression classes, and the MR located status fetcher 
collates it (IOStatisticsSnapshot is a static snapshot which does aggregation, 
is serializable via java object streams (your code could return it) and JSON 
(s3a committer will report what it collects)
   
   Although I'm using the S3A codebase to drive that API and the support 
classes, we've been getting ABFS ready for it too; should only take a single 
patch to move it over to this as well.
   
   There's been an API to get counters in the S3A streams for a while, but its 
private and unstable, so those people who wanted at it (impala) couldn't safely 
do so. This gives everyone something public with more things collected and 
aggregation thereof.
   
   > The idea here is to push it out to the workers (in part per-host rate 
limiting) but also matching the code we have in the SQL side so we have less 
maintianence cost.
   
   what's doing the throttling here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to