Are you sure the UI is the bottleneck? The UI gets back a JSON
representation of this data structure:
https://github.com/apache/flink/blob/2e21321f9c9d9aada7e4ad8ca90d915c34f58015/flink-runtime/src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexFlameGraph.java


All samples from the individual subtasks get merged in the backend, the UI
just renders this one data structure. The complexity is *O(s)*, where *s*
is the number of elements on the stack, not *O(s*n)* where *n* is the
number of subtasks. Since all subtasks execute the same code, *s* is
expected to be stable regardless of the parallelism.

Best,
Alexander Fedulov

On Fri, Feb 11, 2022 at 11:01 AM David Morávek <d...@apache.org> wrote:

> There are already tools [1] that simplify this for the user.
>
> I honestly don't know, it feels like it can bring more problems that actual
> benefits as this heavily relies on the environment. It can easily break for
> some users, eg. because of the kernel settings; their architecture might
> not be supported; Also we'd need to go an extra mile regarding the
> security.
>
> Considering there are already other tools that are specifically designed
> for this (such as [1]), I personally don't feel that this should be part of
> Flink.
>
> [1] https://github.com/yahoo/kubectl-flame
>
>
> On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid>
> wrote:
>
> > Our flink application is on k8s.Yes, user can use the async-profiler
> > directly, but it is not convenient for user, who should download the jars
> > and need to know how to use it. And some users don’t know the tool.if we
> > integrate it, user will benefit a lot.
> >
> > On 2022/01/26 18:56:17 David Morávek wrote:
> > > I'd second to Alex's concerns. Is there a reason why you can't use the
> > > async-profiler directly? In what kind of environment are your Flink
> > > clusters running (YARN / k8s / ...)?
> > >
> > > Best,
> > > D.
> > >
> > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov <al...@ververica.com
> >
> > > wrote:
> > >
> > >> Hi Jacky,
> > >>
> > >> Could you please clarify what kind of *problems* you experience with
> the
> > >> large parallelism? You referred to D3, is it something related to
> > rendering
> > >> on the browser side or is it about the samples collection process?
> Were
> > you
> > >> able to identify the bottleneck?
> > >>
> > >> Fundamentally I have some concerns regarding the proposed approach:
> > >> 1. Calling shell scripts triggered via the web UI is a security
> concern
> > and
> > >> it needs to be evaluated carefully if it could introduce any
> unexpected
> > >> attack vectors (depending on the implementation, passed parameters
> etc.)
> > >> 2. My understanding is that the async-profiler implementation is
> > >> system-dependent. How do you propose to handle multiple architectures?
> > >> Would you like to ship each available implementation within Flink? [1]
> > >> 3. Do you plan to make use of full async-profiler features including
> > native
> > >> calls sampling with perf_events? If so, the issue I see is that some
> > >> environments restrict ptrace calls by default [2]
> > >>
> > >> [1] https://github.com/jvm-profiling-tools/async-profiler#download
> > >> [2]
> > >>
> > >>
> >
> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
> > >>
> > >>
> > >> Best,
> > >> Alexander Fedulov
> > >>
> > >> On Wed, Jan 26, 2022 at 1:59 PM 李森 <li...@icloud.com.invalid> wrote:
> > >>
> > >>> This is an expected feature, as we also experienced browser crashes
> on
> > >>> existing operator-level flame graphs
> > >>>
> > >>> Best,
> > >>> Echo Lee
> > >>>
> > >>>> 在 2022年1月24日,下午6:16,David Morávek <da...@gmail.com> 写道:
> > >>>>
> > >>>> Hi Jacky,
> > >>>>
> > >>>> The link seems to be broken, here is the correct one [1].
> > >>>>
> > >>>> [1]
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > >>>>
> > >>>> Best,
> > >>>> D.
> > >>>>
> > >>>>> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
> > >>> wrote:
> > >>>>>
> > >>>>> Hi All,
> > >>>>> &nbsp; &nbsp; I would like to start the discussion on FLIP-213 <
> > >>>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs&gt
> > >>>>> ;
> > >>>>> &nbsp;which aims to provide taskmanager level(process level) flame
> > >> graph
> > >>>>> by async profiler, which is most popular tool in java performance.
> > and
> > >>> the
> > >>>>> arthas and intellij both use it.&nbsp;
> > >>>>> And we support it in our ant group company.
> > >>>>> &nbsp; &nbsp;And&nbsp;Flink supports FLIP-165: Operator's Flame
> > Graphs
> > >>>>> now. and it draw flame graph by the&nbsp;front-end
> > >>>>> libraries&nbsp;d3-flame-graph, which has some problem in&nbsp; jobs
> > >>>>> of&nbsp;large of parallelism.
> > >>>>> &nbsp; &nbsp;Please be aware that the FLIP wiki area is not fully
> > done
> > >>>>> since i don't konw whether it will accept by
> > >> flink&nbsp;community.&nbsp;
> > >>>>> &nbsp; &nbsp;Feel free to add your thoughts to make this feature
> > >>> better! i
> > >>>>> am looking forward&nbsp; to all your response. Thanks too much!
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Best Jacky Lau
> > >>>
> > >>
> >
>

Reply via email to