Samrat Deb created FLINK-39544:
----------------------------------

             Summary: Improve observability in flink-s3-fs-native by exposing 
operation-level S3 metrics
                 Key: FLINK-39544
                 URL: https://issues.apache.org/jira/browse/FLINK-39544
             Project: Flink
          Issue Type: New Feature
          Components: Connectors / FileSystem
    Affects Versions: 2.3.0
            Reporter: Samrat Deb
             Fix For: 2.4.0


flink-s3-fs-native currently exposes only coarse IO counters. Operators cannot 
see
per-operation latency, S3 throttling (HTTP 503 SlowDown), retry counts and 
reasons,
multipart-upload lifecycle, stream reopens, or connection-pool saturation 
through
Flink's metric system. When the checkpoint duration regresses in production, 
there is
no Flink signal to attribute the cause to S3 vs the network vs the state 
backend.

Diagnosing such incidents today requires correlating Flink logs with AWS 
CloudTrail
or capturing packets, neither scales as a routine operational practice.

This ticket proposes to bridge AWS SDK v2's built-in MetricPublisher into 
Flink's
MetricGroup from inside flink-s3-fs-native, plus a small set of plugin-specific
metrics that the SDK cannot see (NativeS3InputStream reopens, RecoverableWriter 
/
multipart-upload lifecycle).

**Why this is targeted at flink-s3-fs-native specifically:**

flink-s3-fs-native owns its S3AsyncClient directly and can therefore attach an
AWS SDK v2 MetricPublisher at client construction. The same approach is not
available to flink-s3-fs-hadoop and flink-s3-fs-presto, because both delegate to
a Hadoop-owned filesystem (org.apache.hadoop.fs.s3a.S3AFileSystem and the Presto
equivalent), which constructs and owns the S3 client internally. Hadoop S3A
exposes its own IOStatistics framework, not AWS SDK v2 MetricPublisher; 
surfacing
Those statistics into Flink would require a separate adapter, a Hadoop version
floor, and is coupled to S3A internals that change across Hadoop releases. Doing
this work in flink-s3-fs-native therefore, has the cleanest dependency footprint
and the lowest classpath risk.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to