Hi, Thanks for driving this Poorvank Bhatia.
As we discussed in Slack, FLIP-530 couldn't be used to enable FlameGraph as is indeed. But I think Rui raised a good point that FlameGraph implementation can be changed to take job configuration into account. Also, I don't fully understand the concerns about enabling FlameGraphs by default. IIUC, the option doesn't tell Flink to collect FGs unconditionally; rather, it allows it to start collecting FGs once the tab is opened in the UI. Or am I missing something? Regards, Roman On Tue, Aug 19, 2025 at 2:43 PM Poorvank Bhatia <[email protected]> wrote: > Hello Gyula, > Thanks for the suggestion! Enabling flamegraphs by default could indeed > improve visibility, and in many stable environments, the passive overhead > is minimal. However, based on our use cases, there are a few practical > reasons we’ve opted to keep them disabled by default: > > 1. Memory & GC Behavior During Sampling: When the flamegraph tab is > opened in the UI, VertexThreadInfoTracker begins continuous stack trace > sampling every statsRefreshInterval (default 60s), with each sample > containing ThreadInfoSample objects containing full StackTraceElement[] > arrays. This sampling introduces non-trivial memory pressure, > especially in > high-parallelism scenarios. So data for all the stack traces across task > managers is then stored on the JobManager heap within > VertexThreadInfoTracker (with each entry > containing ThreadInfoSample instances). We observed that this structure > accumulates rapidly with high parallelism (>1000) and deep stack > sampling, > causing memory issues in JM. (Memory retention until cleanUpInterval > (default 10 minutes)). The major issue is that JM OOM affects the entire > cluster availability :( > 2. Flink doesn't persist flamegraph data: Flamegraph samples are held > entirely in memory. For future iterations, we’re considering storing > them > temporarily to local disk or external storage (but that requires > significant changes), which would decouple the UI tab from memory > pressure. > 3. Config values still come from RestOptions: Even if we enable > flamegraphs by default, the sampling parameters (e.g., numSamples, > stackDepth, delayBetweenSamples) are still initialized via RestOptions > on > JobManager startup. Without dynamic REST reconfiguration (as proposed), > users would still need to restart the cluster to change them. > > Let me know if that makes sense. > > On Tue, Aug 19, 2025 at 4:59 PM Gyula Fóra <[email protected]> wrote: > > > Hey! > > Instead of adding new logic for this, can we make the flamegraphs enabled > > by default? > > > > Based on my experience almost everyone wants it enabled , doesn't seem to > > add any overhead when they are not actually checked on the UI > > > > Cheers, > > Gyula > > > > On Tue, Aug 19, 2025 at 1:27 PM Danny Cranmer <[email protected]> > > wrote: > > > > > Hello Poorvank, > > > > > > Thanks for driving this, I can understand how dynamically enabling > > > FlameGraphs can be powerful, so +1 on the general idea. > > > > > > 1. Instead of adding a FlameGraph specific REST API did you consider > > adding > > > a more general config API? Similar to that of the dynamical job > > > configuration [1] endpoint but for cluster configs instead of job? We > > could > > > add an allow list of supported config options and start with > Flamegraph. > > > This would allow other configs to use the API in the future without > > adding > > > more APIs. > > > 2. nit: As for the UI, I would prefer for the settings to take up less > > > space. The new options are at the top of the view, even when not > > expanded. > > > > > > Thanks, > > > Danny > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration > > > > > > On Tue, Aug 12, 2025 at 4:31 AM Poorvank Bhatia < > [email protected]> > > > wrote: > > > > > > > Hi all, > > > > > > > > I would like to open a discussion proposing the ability to enable > > > > flamegraphs at runtime and make their configuration i.e number of > > > samples, > > > > delay between samples, and stack depth *dynamically adjustable via > the > > > Web > > > > UI*, without requiring any job or cluster restarts. > > > > > > > > As of now, enabling flamegraphs requires setting > > > > *rest.flamegraph.enabled=true* and restarting the Job. This is not > > ideal > > > > for debugging live issues, especially in production environments. > > > > > > > > I discussed this idea offline with Roman Khachatryan (author of > > FLIP-530 > > > > < > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration > > > > >), > > > > Rui Fan, and Arvid Heise. While Rui noted that this could potentially > > > align > > > > with FLIP-530’s direction, Roman confirmed that it’s better handled > as > > a > > > > separate effort, since FLIP-530 > > > > < > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration > > > > > > > > > is scoped to job-level config, whereas this proposal addresses > > > > cluster-level observability via RestOptions. > > > > > > > > For Design Details, Please refer: Dynamic Flamegraph via UI > > > > < > > > > > > > > > > https://docs.google.com/document/d/1A9fLFgXMGxQQn6X8WCv7mLL21AnLqrDFvLSHnUg8rLA/edit?tab=t.0#heading=h.s351fc464ma6 > > > > > > > > > > > > > I’ve attached a short demo to help visualize the proposed feature and > > > > gather feedback. Demo > > > > < > > > > > > > > > > https://drive.google.com/file/d/1iik6aOc2uc9sFlHFlT8YDX5TKFdoD15u/view?usp=sharing > > > > > > > > > > > > > Looking forward to your thoughts. > > > > > > > > Regards, > > > > > > > > Poorvank Bhatia > > > > > > > > > >
