ok, stream level. No, it's not the same.
For those s3a input stream stats, you don't need to go into the s3a internals 1. every source of IOStats implements InputStreamStatistics, which is hadoop-common code 2. in close() s3a input streams update thread level IOStatisticsContext ( https://issues.apache.org/jira/browse/HADOOP-17461 ... some stabilisation so use with Hadoop 3.4.0/Spark 4.0+) The thread stuff is so streams opened and closed in, say, the parquet reader, update stats just for that worker thread even though you never get near the stream instance itself. Regarding iceberg fileio stats, well, maybe someone could add it to the classes. Spark 4+ could think about collecting the stats for each task and aggregating, as that was the original goal. You get that aggregation indirectly on s3a with the s3a committers, similar through abfs, but really spark should just collect and report it itself. On Thu, 12 Feb 2026 at 17:03, Romain Manni-Bucau <[email protected]> wrote: > Hi Steve, > > Do you reference org.apache.iceberg.io.FileIOMetricsContext and > org.apache.hadoop.fs.FileSystem.Statistics.StatisticsData? It misses most > of what I'm looking for (429 to cite a single one). > software.amazon.awssdk.metrics helps a bit but is not sink friendly. > Compared to hadoop-aws usage combining iceberg native and aws s3 client > ones kind of compensate the lack but what I would love to see > is org.apache.hadoop.fs.s3a.S3AInstrumentation and more particularly > org.apache.hadoop.fs.s3a.S3AInstrumentation.InputStreamStatistics#InputStreamStatistics > (I'm mainly reading for my use cases). > > > Romain Manni-Bucau > @rmannibucau <https://x.com/rmannibucau> | .NET Blog > <https://dotnetbirdie.github.io/> | Blog <https://rmannibucau.github.io/> | > Old Blog <http://rmannibucau.wordpress.com> | Github > <https://github.com/rmannibucau> | LinkedIn > <https://www.linkedin.com/in/rmannibucau> | Book > <https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064> > Javaccino founder (Java/.NET service - contact via linkedin) > > > Le jeu. 12 févr. 2026 à 15:50, Steve Loughran <[email protected]> a > écrit : > >> >> >> On Thu, 12 Feb 2026 at 10:39, Romain Manni-Bucau <[email protected]> >> wrote: >> >>> Hi all, >>> >>> Is it intended that S3FileIO doesn't wire [aws >>> sdk].ClientOverrideConfiguration.Builder#addMetricPublisher so basically, >>> compared to hadoop-aws you can't retrieve metrics from Spark (or any other >>> engine) and send them to a collector in a centralized manner? >>> Is there another intended way? >>> >> >> already a PR up awaiting review by committers >> https://github.com/apache/iceberg/pull/15122 >> >> >> >>> >>> For plain hadoop-aws the workaround is to use (by reflection) >>> S3AInstrumentation.getMetricsSystem().allSources() and wire it to a >>> spark sink. >>> >> >> The intended way to do it there is to use the IOStatistics API which not >> only lets you at the s3a stats, google cloud collects stuff the same way, >> and there's explicit collection of things per thread for stream read and >> write.... >> >> try setting >> >> fs.iostatistics.logging.level info >> >> to see what gets measured >> >> >>> To be clear I do care about the byte written/read but more importantly >>> about the latency, number of requests, statuses etc. The stats exposed >>> through FileSystem in iceberg are < 10 whereas we should get >> 100 stats >>> (taking hadoop as a ref). >>> >> >> AWS metrics are a very limited sets >> >> software.amazon.awssdk.core.metrics.CoreMetric >> >> The retry count is good here as it measures stuff beneath any application >> code. With the rest signer, it'd make sense to also collect signing time, >> as the RPC call to the signing endpoint would be included. >> >> -steve >> >
