steveloughran commented on pull request #29179: URL: https://github.com/apache/spark/pull/29179#issuecomment-663523126
> Interesting. Is this specific to the S3A impl or is there a higher base class? I want to make it work with multiple file formats if possible. it's in hadoop common with an interface IOStatisticsSource which can be implemented by anything that feels like it; there's passthrough in the core hadoop io stream/compression classes, and the MR located status fetcher collates it (IOStatisticsSnapshot is a static snapshot which does aggregation, is serializable via java object streams (your code could return it) and JSON (s3a committer will report what it collects) Although I'm using the S3A codebase to drive that API and the support classes, we've been getting ABFS ready for it too; should only take a single patch to move it over to this as well. There's been an API to get counters in the S3A streams for a while, but its private and unstable, so those people who wanted at it (impala) couldn't safely do so. This gives everyone something public with more things collected and aggregation thereof. > The idea here is to push it out to the workers (in part per-host rate limiting) but also matching the code we have in the SQL side so we have less maintianence cost. what's doing the throttling here? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org