Re: S3FileIO has no advanced metrics?

Steve Loughran Fri, 13 Feb 2026 03:02:50 -0800

On Thu, 12 Feb 2026 at 20:52, Romain Manni-Bucau <[email protected]>
wrote:


> Commented inline
>
> Romain Manni-Bucau
> @rmannibucau <https://x.com/rmannibucau> | .NET Blog
> <https://dotnetbirdie.github.io/> | Blog <https://rmannibucau.github.io/> |
> Old Blog <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064>
> Javaccino founder (Java/.NET service - contact via linkedin)
>
>
> Le jeu. 12 févr. 2026 à 21:13, Steve Loughran <[email protected]> a
> écrit :
>
>>
>>
>> you get all thread local stats for a specific thread
>> from IOStatisticsContext.getCurrentIOStatisticsContext().getIOStatistics()
>>
>
> How is it supposed to work, my understanding is that it is basically a
> thread local like impl based on a map - important point being it works in
> the same bound thread - whereas the data is pulled from the sink in a
> scheduled executor thread so I would still need to do my registry/sync it
> with spark metrics system no?
>
>
>>
>> take a snapshot and that and you have something json marshallable or java
>> serializable which aggregates nicely
>>
>> Call  IOStatisticsContext.getCurrentIOStatisticsContext().reset() when
>> your worker thread starts a specific task to ensure you only get the stats
>> for that task (s3a & I think gcs).
>>
>
> Do you mean impl my own S3A or file io? This is the instrumentation I
> tried to avoid since I think it should be built-in, not in apps.
>

more that spark worker threads need to reset the stats once they pick up
their next piece of work, collect the changes then push up the stats on
task commit, and job commit aggregates these.

The s3a committers do all this behind the scenes (first into the
intermediate manifest then into the final _SUCCESS file). Now that spark
builds with a version with the API, someone could consider doing it there
and lining up with spark history server. Then whatever fs client, input
stream or any other instrumented component would just add its numbers)


>
>>
>> from the fs you getIOStatistics() and you get all the stats of all
>> filesystems and streams after close(). which from a quick look at some s3
>> io to a non-aws store shows a couple of failures, interestingly enough. We
>> collect separate averages for success and failure on every op so you can
>> see the difference.
>>
>> the JMX stats we collect are a very small subset of the statistics, stuff
>> like "bytes drained in close"  and time to wait for an executor in the
>> thread pool (action_executor_acquired) are important as they're generally
>> sign of misconfigurations
>>
>
> Yep, my focus high level is to see if the tuning or tables must be tuned
> so 429, volume, latencies are key there.
>

If you turn on AWS S3 server logging you will get numbers of 503 throttle
events and the paths; 429 is other stores. Bear in mind that the recipients
of the throttle events may not be the only caller triggering it...things
like bulk delete (hello, compaction) can throttle other work going on
against the same shard.


Another thing I don't get is why not reusing hadoop-aws in spark? It would
> at least enable to mix datasources more nicely and focus in a single
> location the work (it is already done).
>
>
>

Well in Cloudera we do. Nothing to stop you.

I also have a PoC of an s3 signer for Hadoop 3.4.3+ which gets its
credentials from the rest server -simply wraps the existing one but picks
up its binding info from the filesystem Configuration.

-Steve

Re: S3FileIO has no advanced metrics?

Reply via email to