Re: [DISCUSS] FLIP-576 Filesystem-Plugin Observability (flink-s3-fs-native)

Aleksandr Iushmanov Thu, 28 May 2026 01:07:15 -0700

Thank you Samrat,

Looks good to me!


Kind regards,
Alex


On Wed, 27 May 2026 at 17:25, Samrat Deb <[email protected]> wrote:

> Hi Aleksandr Iushmanov,
>
> > The proposal overall looks good to me, but I have a concern around the
> > number of metrics we enable by default. As you have mentioned in the doc,
> > the number of added time series is ~50. I have a feeling that enabling
> them
> > by default may lead to unpleasant surprises in terms of extra cardinality
> > and the volume of exported data unless it is guarded through allowlists.
> My
> > personal preference would be to keep this option opt-in.
>
> Thank you for the suggestion. The opt-in makes sense. It would allow users
> to decide the cardinality of metrics within their setup.
> Here is my plan to add changes to the flip
>
>   s3.metrics.enabled: true
>
>   s3.metrics.allowlist:
>      - api_call_count
>
>
>      - api_call_duration_ms
>
>
>      - throttle_count
>
>
>      - retry_count
>
>
>      - iops
>
>
>      - mpu_aborted_total
>  s3.metrics.detailed.enabled: false
>
>
> Best,
> Samrat
>
>
>
> On Fri, May 22, 2026 at 5:26 PM Gabor Somogyi <[email protected]>
> wrote:
>
> > @Samrat
> > Thanks for the detailed explanation for the metrics usage.
> >
> > Throttling is not supported by the actual implementation even though
> > we plan to add metrics for it. It's good to go however, I'm about to add
> > throttling support soon.
> >
> > ------------
> >
> > One small API refinement worth considering: instead of adding a second
> > "configure(Configuration, MetricGroup)"
> > overload toFileSystemFactory, introduce a separate opt-in interface:
> >
> > public interface MetricsAware {
> >     void setMetricGroup(MetricGroup metricGroup);
> > }
> >
> > Then inside FileSystem.initialize():
> > for (FileSystemFactory factory : factories) {
> >     if (factory instanceof MetricsAware) {
> >         ((MetricsAware) factory).setMetricGroup(metricGroup);
> >     }
> > }
> >
> > This keeps FileSystemFactory's contract unchanged, third-party
> > implementations need zero
> > modifications unless they want metrics. The FLIP's default-on collection
> is
> > fine; this is purely an interface hygiene suggestion.
> >
> > @Aleksandr
> > If opt-in means "s3.metrics.enabled" defaults to "false", I'd say that's
> > not the way to go.
> > Observability features that require pre-incident configuration tend to
> > never get enabled,
> > which directly defeats the FLIP's stated goal of closing the operational
> > blindness gap.
> >
> > The concern about cardinality is legitimate, but the math is favorable:
> > these ~50 series are at
> > TM scope, not subtask scope. A 100-TM cluster adds roughly 5,000 series
> > which is modest
> > compared to what operator-level metrics already emit.
> >
> > The right answer is informed default-on with a clear escape hatch. The
> FLIP
> > already has
> > the split between basic (default-on, bounded cardinality) and detailed
> > (opt-in via "s3.metrics.detailed.enabled").
> > Teams with strict cardinality budgets can also suppress the entire group
> at
> > the reporter level with a single line:
> > metrics.reporter.<name>.filter.excludes = *.filesystem.*:*:*
> >
> > During performance testing we're intended to measure things in-depth and
> if
> > something
> > blows up then fine tuning is still a possibilty during PR review.
> >
> > G
> >
> >
> > On Thu, May 21, 2026 at 6:12 PM Aleksandr Iushmanov <[email protected]
> >
> > wrote:
> >
> > > Hi Samrat,
> > >
> > > Thank you for putting it together. I believe that this is a good
> addition
> > > to ensure that Flink is operation ready.
> > >
> > > The proposal overall looks good to me, but I have a concern around the
> > > number of metrics we enable by default. As you have mentioned in the
> doc,
> > > the number of added time series is ~50. I have a feeling that enabling
> > them
> > > by default may lead to unpleasant surprises in terms of extra
> cardinality
> > > and the volume of exported data unless it is guarded through
> allowlists.
> > My
> > > personal preference would be to keep this option opt-in.
> > >
> > > Please let me know your thoughts on this.
> > >
> > > Kind regards,
> > > Alex
> > >
> > >
> > > On Tue, 5 May 2026 at 10:58, Samrat Deb <[email protected]> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I'd like to open a discussion on FLIP-576: Filesystem-Plugin
> > > Observability
> > > > for (flink-s3-fs-native)[1].
> > > >
> > > > Apache Flink’s filesystem layer is critical to core operations like
> > > > checkpoints, savepoints, and state access. Most of which rely heavily
> > on
> > > > S3. Despite this, the current observability in s3<>flink is offering
> > > little
> > > > insight into underlying issues. Engineers lack visibility into key
> > > failure
> > > > signals, including S3 throttling, retry behaviour, slow operations,
> > load
> > > > distribution, multipart upload leaks, and intermittent stream
> failures.
> > > As
> > > > a result, diagnosing production issues often requires manual
> > correlation
> > > > across logs and external systems, making troubleshooting slow and
> > > > unreliable. This observability gap significantly impacts the
> > operability
> > > of
> > > > Flink in real-world large-scale deployments.
> > > > This FLIP proposal addresses the same and builds support for native
> S3
> > > FS.
> > > >
> > > > Looking forward to your feedback.
> > > >
> > > > Bests,
> > > > Samrat
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-576 Filesystem-Plugin Observability (flink-s3-fs-native)

Reply via email to