Samrat Deb created FLINK-39544:
----------------------------------
Summary: Improve observability in flink-s3-fs-native by exposing
operation-level S3 metrics
Key: FLINK-39544
URL: https://issues.apache.org/jira/browse/FLINK-39544
Project: Flink
Issue Type: New Feature
Components: Connectors / FileSystem
Affects Versions: 2.3.0
Reporter: Samrat Deb
Fix For: 2.4.0
flink-s3-fs-native currently exposes only coarse IO counters. Operators cannot
see
per-operation latency, S3 throttling (HTTP 503 SlowDown), retry counts and
reasons,
multipart-upload lifecycle, stream reopens, or connection-pool saturation
through
Flink's metric system. When the checkpoint duration regresses in production,
there is
no Flink signal to attribute the cause to S3 vs the network vs the state
backend.
Diagnosing such incidents today requires correlating Flink logs with AWS
CloudTrail
or capturing packets, neither scales as a routine operational practice.
This ticket proposes to bridge AWS SDK v2's built-in MetricPublisher into
Flink's
MetricGroup from inside flink-s3-fs-native, plus a small set of plugin-specific
metrics that the SDK cannot see (NativeS3InputStream reopens, RecoverableWriter
/
multipart-upload lifecycle).
**Why this is targeted at flink-s3-fs-native specifically:**
flink-s3-fs-native owns its S3AsyncClient directly and can therefore attach an
AWS SDK v2 MetricPublisher at client construction. The same approach is not
available to flink-s3-fs-hadoop and flink-s3-fs-presto, because both delegate to
a Hadoop-owned filesystem (org.apache.hadoop.fs.s3a.S3AFileSystem and the Presto
equivalent), which constructs and owns the S3 client internally. Hadoop S3A
exposes its own IOStatistics framework, not AWS SDK v2 MetricPublisher;
surfacing
Those statistics into Flink would require a separate adapter, a Hadoop version
floor, and is coupled to S3A internals that change across Hadoop releases. Doing
this work in flink-s3-fs-native therefore, has the cleanest dependency footprint
and the lowest classpath risk.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)