Are you sure the UI is the bottleneck? The UI gets back a JSON representation of this data structure: https://github.com/apache/flink/blob/2e21321f9c9d9aada7e4ad8ca90d915c34f58015/flink-runtime/src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexFlameGraph.java
All samples from the individual subtasks get merged in the backend, the UI just renders this one data structure. The complexity is *O(s)*, where *s* is the number of elements on the stack, not *O(s*n)* where *n* is the number of subtasks. Since all subtasks execute the same code, *s* is expected to be stable regardless of the parallelism. Best, Alexander Fedulov On Fri, Feb 11, 2022 at 11:01 AM David Morávek <d...@apache.org> wrote: > There are already tools [1] that simplify this for the user. > > I honestly don't know, it feels like it can bring more problems that actual > benefits as this heavily relies on the environment. It can easily break for > some users, eg. because of the kernel settings; their architecture might > not be supported; Also we'd need to go an extra mile regarding the > security. > > Considering there are already other tools that are specifically designed > for this (such as [1]), I personally don't feel that this should be part of > Flink. > > [1] https://github.com/yahoo/kubectl-flame > > > On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid> > wrote: > > > Our flink application is on k8s.Yes, user can use the async-profiler > > directly, but it is not convenient for user, who should download the jars > > and need to know how to use it. And some users don’t know the tool.if we > > integrate it, user will benefit a lot. > > > > On 2022/01/26 18:56:17 David Morávek wrote: > > > I'd second to Alex's concerns. Is there a reason why you can't use the > > > async-profiler directly? In what kind of environment are your Flink > > > clusters running (YARN / k8s / ...)? > > > > > > Best, > > > D. > > > > > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov <al...@ververica.com > > > > > wrote: > > > > > >> Hi Jacky, > > >> > > >> Could you please clarify what kind of *problems* you experience with > the > > >> large parallelism? You referred to D3, is it something related to > > rendering > > >> on the browser side or is it about the samples collection process? > Were > > you > > >> able to identify the bottleneck? > > >> > > >> Fundamentally I have some concerns regarding the proposed approach: > > >> 1. Calling shell scripts triggered via the web UI is a security > concern > > and > > >> it needs to be evaluated carefully if it could introduce any > unexpected > > >> attack vectors (depending on the implementation, passed parameters > etc.) > > >> 2. My understanding is that the async-profiler implementation is > > >> system-dependent. How do you propose to handle multiple architectures? > > >> Would you like to ship each available implementation within Flink? [1] > > >> 3. Do you plan to make use of full async-profiler features including > > native > > >> calls sampling with perf_events? If so, the issue I see is that some > > >> environments restrict ptrace calls by default [2] > > >> > > >> [1] https://github.com/jvm-profiling-tools/async-profiler#download > > >> [2] > > >> > > >> > > > https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces > > >> > > >> > > >> Best, > > >> Alexander Fedulov > > >> > > >> On Wed, Jan 26, 2022 at 1:59 PM 李森 <li...@icloud.com.invalid> wrote: > > >> > > >>> This is an expected feature, as we also experienced browser crashes > on > > >>> existing operator-level flame graphs > > >>> > > >>> Best, > > >>> Echo Lee > > >>> > > >>>> 在 2022年1月24日,下午6:16,David Morávek <da...@gmail.com> 写道: > > >>>> > > >>>> Hi Jacky, > > >>>> > > >>>> The link seems to be broken, here is the correct one [1]. > > >>>> > > >>>> [1] > > >>>> > > >>> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs > > >>>> > > >>>> Best, > > >>>> D. > > >>>> > > >>>>> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid> > > >>> wrote: > > >>>>> > > >>>>> Hi All, > > >>>>> I would like to start the discussion on FLIP-213 < > > >>>>> > > >>> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs> > > >>>>> ; > > >>>>> which aims to provide taskmanager level(process level) flame > > >> graph > > >>>>> by async profiler, which is most popular tool in java performance. > > and > > >>> the > > >>>>> arthas and intellij both use it. > > >>>>> And we support it in our ant group company. > > >>>>> And Flink supports FLIP-165: Operator's Flame > > Graphs > > >>>>> now. and it draw flame graph by the front-end > > >>>>> libraries d3-flame-graph, which has some problem in jobs > > >>>>> of large of parallelism. > > >>>>> Please be aware that the FLIP wiki area is not fully > > done > > >>>>> since i don't konw whether it will accept by > > >> flink community. > > >>>>> Feel free to add your thoughts to make this feature > > >>> better! i > > >>>>> am looking forward to all your response. Thanks too much! > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> Best Jacky Lau > > >>> > > >> > > >