Hi,

Thanks for driving this Poorvank Bhatia.

As we discussed in Slack, FLIP-530 couldn't be used to enable FlameGraph as
is indeed. But I think Rui raised a good point that FlameGraph
implementation can be changed to take job configuration into account.

Also, I don't fully understand the concerns about enabling FlameGraphs by
default. IIUC, the option doesn't tell Flink to collect FGs
unconditionally; rather, it allows it to start collecting FGs once the tab
is opened in the UI. Or am I missing something?


Regards,
Roman


On Tue, Aug 19, 2025 at 2:43 PM Poorvank Bhatia <[email protected]>
wrote:

> Hello Gyula,
> Thanks for the suggestion! Enabling flamegraphs by default could indeed
> improve visibility, and in many stable environments, the passive overhead
> is minimal. However, based on our use cases, there are a few practical
> reasons we’ve opted to keep them disabled by default:
>
>    1. Memory & GC Behavior During Sampling:  When the flamegraph tab is
>    opened in the UI, VertexThreadInfoTracker begins continuous stack trace
>    sampling every statsRefreshInterval (default 60s), with each sample
>    containing ThreadInfoSample objects containing full StackTraceElement[]
>    arrays. This sampling introduces non-trivial memory pressure,
> especially in
>    high-parallelism scenarios. So data for all the stack traces across task
>    managers is then stored on the JobManager heap within
>    VertexThreadInfoTracker (with each entry
>    containing ThreadInfoSample instances). We observed that this structure
>    accumulates rapidly with high parallelism (>1000) and deep stack
> sampling,
>    causing memory issues in JM. (Memory retention until cleanUpInterval
>    (default 10 minutes)). The major issue is that JM OOM affects the entire
>    cluster availability :(
>    2. Flink doesn't persist flamegraph data: Flamegraph samples are held
>    entirely in memory. For future iterations, we’re considering storing
> them
>    temporarily to local disk or external storage (but that requires
>    significant changes), which would decouple the UI tab from memory
> pressure.
>    3. Config values still come from RestOptions: Even if we enable
>    flamegraphs by default, the sampling parameters (e.g., numSamples,
>    stackDepth, delayBetweenSamples) are still initialized via RestOptions
> on
>    JobManager startup. Without dynamic REST reconfiguration (as proposed),
>    users would still need to restart the cluster to change them.
>
> Let me know if that makes sense.
>
> On Tue, Aug 19, 2025 at 4:59 PM Gyula Fóra <[email protected]> wrote:
>
> > Hey!
> > Instead of adding new logic for this, can we make the flamegraphs enabled
> > by default?
> >
> > Based on my experience almost everyone wants it enabled , doesn't seem to
> > add any overhead when they are not actually checked on the UI
> >
> > Cheers,
> > Gyula
> >
> > On Tue, Aug 19, 2025 at 1:27 PM Danny Cranmer <[email protected]>
> > wrote:
> >
> > > Hello Poorvank,
> > >
> > > Thanks for driving this, I can understand how dynamically enabling
> > > FlameGraphs can be powerful, so +1 on the general idea.
> > >
> > > 1. Instead of adding a FlameGraph specific REST API did you consider
> > adding
> > > a more general config API? Similar to that of the dynamical job
> > > configuration [1] endpoint but for cluster configs instead of job? We
> > could
> > > add an allow list of supported config options and start with
> Flamegraph.
> > > This would allow other configs to use the API in the future without
> > adding
> > > more APIs.
> > > 2. nit: As for the UI, I would prefer for the settings to take up less
> > > space. The new options are at the top of the view, even when not
> > expanded.
> > >
> > > Thanks,
> > > Danny
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
> > >
> > > On Tue, Aug 12, 2025 at 4:31 AM Poorvank Bhatia <
> [email protected]>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to open a discussion proposing the ability to enable
> > > > flamegraphs at runtime and make their configuration i.e number of
> > > samples,
> > > > delay between samples, and stack depth *dynamically adjustable via
> the
> > > Web
> > > > UI*, without requiring any job or cluster restarts.
> > > >
> > > > As of now, enabling flamegraphs requires setting
> > > > *rest.flamegraph.enabled=true* and restarting the Job. This is not
> > ideal
> > > > for debugging live issues, especially in production environments.
> > > >
> > > > I discussed this idea offline with Roman Khachatryan (author of
> > FLIP-530
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
> > > > >),
> > > > Rui Fan, and Arvid Heise. While Rui noted that this could potentially
> > > align
> > > > with FLIP-530’s direction, Roman confirmed that it’s better handled
> as
> > a
> > > > separate effort, since FLIP-530
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
> > > > >
> > > > is scoped to job-level config, whereas this proposal addresses
> > > > cluster-level observability via RestOptions.
> > > >
> > > > For Design Details, Please refer: Dynamic Flamegraph via UI
> > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/1A9fLFgXMGxQQn6X8WCv7mLL21AnLqrDFvLSHnUg8rLA/edit?tab=t.0#heading=h.s351fc464ma6
> > > > >
> > > >
> > > > I’ve attached a short demo to help visualize the proposed feature and
> > > > gather feedback. Demo
> > > > <
> > > >
> > >
> >
> https://drive.google.com/file/d/1iik6aOc2uc9sFlHFlT8YDX5TKFdoD15u/view?usp=sharing
> > > > >
> > > >
> > > > Looking forward to your thoughts.
> > > >
> > > > Regards,
> > > >
> > > > Poorvank Bhatia
> > > >
> > >
> >
>

Reply via email to