Re: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-11-05 Thread Yun Tang
Hi Jacky,

Apart from the continuous discussions, I think you can share current progress 
or even make it public as one of the flink-packages [1]


[1] https://flink-packages.org/


Best
Yun Tang

From: Jacky Lau <281293...@qq.com.INVALID>
Sent: Friday, February 11, 2022 10:11
To: dev@flink.apache.org 
Subject: RE: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

Our flink application is on k8s.Yes, user can use the async-profiler directly, 
but it is not convenient for user, who should download the jars and need to 
know how to use it. And some users don’t know the tool.if we integrate it, user 
will benefit a lot.

On 2022/01/26 18:56:17 David Morávek wrote:
> I'd second to Alex's concerns. Is there a reason why you can't use the
> async-profiler directly? In what kind of environment are your Flink
> clusters running (YARN / k8s / ...)?
>
> Best,
> D.
>
> On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov 
> wrote:
>
>> Hi Jacky,
>>
>> Could you please clarify what kind of *problems* you experience with the
>> large parallelism? You referred to D3, is it something related to rendering
>> on the browser side or is it about the samples collection process? Were you
>> able to identify the bottleneck?
>>
>> Fundamentally I have some concerns regarding the proposed approach:
>> 1. Calling shell scripts triggered via the web UI is a security concern and
>> it needs to be evaluated carefully if it could introduce any unexpected
>> attack vectors (depending on the implementation, passed parameters etc.)
>> 2. My understanding is that the async-profiler implementation is
>> system-dependent. How do you propose to handle multiple architectures?
>> Would you like to ship each available implementation within Flink? [1]
>> 3. Do you plan to make use of full async-profiler features including native
>> calls sampling with perf_events? If so, the issue I see is that some
>> environments restrict ptrace calls by default [2]
>>
>> [1] https://github.com/jvm-profiling-tools/async-profiler#download
>> [2]
>>
>> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
>>
>>
>> Best,
>> Alexander Fedulov
>>
>> On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
>>
>>> This is an expected feature, as we also experienced browser crashes on
>>> existing operator-level flame graphs
>>>
>>> Best,
>>> Echo Lee
>>>
>>>> 在 2022年1月24日,下午6:16,David Morávek  写道:
>>>>
>>>> Hi Jacky,
>>>>
>>>> The link seems to be broken, here is the correct one [1].
>>>>
>>>> [1]
>>>>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
>>>>
>>>> Best,
>>>> D.
>>>>
>>>>> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
>>> wrote:
>>>>>
>>>>> Hi All,
>>>>>     I would like to start the discussion on FLIP-213 <
>>>>>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs>
>>>>> ;
>>>>>  which aims to provide taskmanager level(process level) flame
>> graph
>>>>> by async profiler, which is most popular tool in java performance. and
>>> the
>>>>> arthas and intellij both use it. 
>>>>> And we support it in our ant group company.
>>>>>    And Flink supports FLIP-165: Operator's Flame Graphs
>>>>> now. and it draw flame graph by the front-end
>>>>> libraries d3-flame-graph, which has some problem in  jobs
>>>>> of large of parallelism.
>>>>>    Please be aware that the FLIP wiki area is not fully done
>>>>> since i don't konw whether it will accept by
>> flink community. 
>>>>>    Feel free to add your thoughts to make this feature
>>> better! i
>>>>> am looking forward  to all your response. Thanks too much!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Best Jacky Lau
>>>
>>


Re: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-02-11 Thread Austin Cawley-Edwards
Pyroscope[1] and Parca[2] are other options for less-intrusive profiling (&
great fits for k8s) that move the burden from Flink & its UI to tools that
are purpose-built for this use case. Perhaps we could investigate what it
would take (if anything) to make Flink compatible with those?

Best,
Austin

[1]: https://pyroscope.io/
[2]: https://www.parca.dev/


On Fri, Feb 11, 2022 at 8:33 AM Alexander Fedulov 
wrote:

> Are you sure the UI is the bottleneck? The UI gets back a JSON
> representation of this data structure:
>
> https://github.com/apache/flink/blob/2e21321f9c9d9aada7e4ad8ca90d915c34f58015/flink-runtime/src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexFlameGraph.java
>
>
> All samples from the individual subtasks get merged in the backend, the UI
> just renders this one data structure. The complexity is *O(s)*, where *s*
> is the number of elements on the stack, not *O(s*n)* where *n* is the
> number of subtasks. Since all subtasks execute the same code, *s* is
> expected to be stable regardless of the parallelism.
>
> Best,
> Alexander Fedulov
>
> On Fri, Feb 11, 2022 at 11:01 AM David Morávek  wrote:
>
> > There are already tools [1] that simplify this for the user.
> >
> > I honestly don't know, it feels like it can bring more problems that
> actual
> > benefits as this heavily relies on the environment. It can easily break
> for
> > some users, eg. because of the kernel settings; their architecture might
> > not be supported; Also we'd need to go an extra mile regarding the
> > security.
> >
> > Considering there are already other tools that are specifically designed
> > for this (such as [1]), I personally don't feel that this should be part
> of
> > Flink.
> >
> > [1] https://github.com/yahoo/kubectl-flame
> >
> >
> > On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid>
> > wrote:
> >
> > > Our flink application is on k8s.Yes, user can use the async-profiler
> > > directly, but it is not convenient for user, who should download the
> jars
> > > and need to know how to use it. And some users don’t know the tool.if
> we
> > > integrate it, user will benefit a lot.
> > >
> > > On 2022/01/26 18:56:17 David Morávek wrote:
> > > > I'd second to Alex's concerns. Is there a reason why you can't use
> the
> > > > async-profiler directly? In what kind of environment are your Flink
> > > > clusters running (YARN / k8s / ...)?
> > > >
> > > > Best,
> > > > D.
> > > >
> > > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov <
> al...@ververica.com
> > >
> > > > wrote:
> > > >
> > > >> Hi Jacky,
> > > >>
> > > >> Could you please clarify what kind of *problems* you experience with
> > the
> > > >> large parallelism? You referred to D3, is it something related to
> > > rendering
> > > >> on the browser side or is it about the samples collection process?
> > Were
> > > you
> > > >> able to identify the bottleneck?
> > > >>
> > > >> Fundamentally I have some concerns regarding the proposed approach:
> > > >> 1. Calling shell scripts triggered via the web UI is a security
> > concern
> > > and
> > > >> it needs to be evaluated carefully if it could introduce any
> > unexpected
> > > >> attack vectors (depending on the implementation, passed parameters
> > etc.)
> > > >> 2. My understanding is that the async-profiler implementation is
> > > >> system-dependent. How do you propose to handle multiple
> architectures?
> > > >> Would you like to ship each available implementation within Flink?
> [1]
> > > >> 3. Do you plan to make use of full async-profiler features including
> > > native
> > > >> calls sampling with perf_events? If so, the issue I see is that some
> > > >> environments restrict ptrace calls by default [2]
> > > >>
> > > >> [1] https://github.com/jvm-profiling-tools/async-profiler#download
> > > >> [2]
> > > >>
> > > >>
> > >
> >
> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
> > > >>
> > > >>
> > > >> Best,
> > > >> Alexander Fedulov
> > > >>
> > > >> On Wed, Jan 26, 2022 at 1:59 PM 李森 
> wrote:
> > > >>
> > > >>> This is an expected feature, as we also experienced browser crashes
> > on
> > > >>> existing operator-level flame graphs
> > > >>>
> > > >>> Best,
> > > >>> Echo Lee
> > > >>>
> > >  在 2022年1月24日,下午6:16,David Morávek  写道:
> > > 
> > >  Hi Jacky,
> > > 
> > >  The link seems to be broken, here is the correct one [1].
> > > 
> > >  [1]
> > > 
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > > 
> > >  Best,
> > >  D.
> > > 
> > > > On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
> > > >>> wrote:
> > > >
> > > > Hi All,
> > > >     I would like to start the discussion on FLIP-213 <
> > > >
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs>
> > > > ;
> > > >  which aims to provide

Re: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-02-11 Thread Alexander Fedulov
Are you sure the UI is the bottleneck? The UI gets back a JSON
representation of this data structure:
https://github.com/apache/flink/blob/2e21321f9c9d9aada7e4ad8ca90d915c34f58015/flink-runtime/src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexFlameGraph.java


All samples from the individual subtasks get merged in the backend, the UI
just renders this one data structure. The complexity is *O(s)*, where *s*
is the number of elements on the stack, not *O(s*n)* where *n* is the
number of subtasks. Since all subtasks execute the same code, *s* is
expected to be stable regardless of the parallelism.

Best,
Alexander Fedulov

On Fri, Feb 11, 2022 at 11:01 AM David Morávek  wrote:

> There are already tools [1] that simplify this for the user.
>
> I honestly don't know, it feels like it can bring more problems that actual
> benefits as this heavily relies on the environment. It can easily break for
> some users, eg. because of the kernel settings; their architecture might
> not be supported; Also we'd need to go an extra mile regarding the
> security.
>
> Considering there are already other tools that are specifically designed
> for this (such as [1]), I personally don't feel that this should be part of
> Flink.
>
> [1] https://github.com/yahoo/kubectl-flame
>
>
> On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid>
> wrote:
>
> > Our flink application is on k8s.Yes, user can use the async-profiler
> > directly, but it is not convenient for user, who should download the jars
> > and need to know how to use it. And some users don’t know the tool.if we
> > integrate it, user will benefit a lot.
> >
> > On 2022/01/26 18:56:17 David Morávek wrote:
> > > I'd second to Alex's concerns. Is there a reason why you can't use the
> > > async-profiler directly? In what kind of environment are your Flink
> > > clusters running (YARN / k8s / ...)?
> > >
> > > Best,
> > > D.
> > >
> > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov  >
> > > wrote:
> > >
> > >> Hi Jacky,
> > >>
> > >> Could you please clarify what kind of *problems* you experience with
> the
> > >> large parallelism? You referred to D3, is it something related to
> > rendering
> > >> on the browser side or is it about the samples collection process?
> Were
> > you
> > >> able to identify the bottleneck?
> > >>
> > >> Fundamentally I have some concerns regarding the proposed approach:
> > >> 1. Calling shell scripts triggered via the web UI is a security
> concern
> > and
> > >> it needs to be evaluated carefully if it could introduce any
> unexpected
> > >> attack vectors (depending on the implementation, passed parameters
> etc.)
> > >> 2. My understanding is that the async-profiler implementation is
> > >> system-dependent. How do you propose to handle multiple architectures?
> > >> Would you like to ship each available implementation within Flink? [1]
> > >> 3. Do you plan to make use of full async-profiler features including
> > native
> > >> calls sampling with perf_events? If so, the issue I see is that some
> > >> environments restrict ptrace calls by default [2]
> > >>
> > >> [1] https://github.com/jvm-profiling-tools/async-profiler#download
> > >> [2]
> > >>
> > >>
> >
> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
> > >>
> > >>
> > >> Best,
> > >> Alexander Fedulov
> > >>
> > >> On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
> > >>
> > >>> This is an expected feature, as we also experienced browser crashes
> on
> > >>> existing operator-level flame graphs
> > >>>
> > >>> Best,
> > >>> Echo Lee
> > >>>
> >  在 2022年1月24日,下午6:16,David Morávek  写道:
> > 
> >  Hi Jacky,
> > 
> >  The link seems to be broken, here is the correct one [1].
> > 
> >  [1]
> > 
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > 
> >  Best,
> >  D.
> > 
> > > On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
> > >>> wrote:
> > >
> > > Hi All,
> > >     I would like to start the discussion on FLIP-213 <
> > >
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs>
> > > ;
> > >  which aims to provide taskmanager level(process level) flame
> > >> graph
> > > by async profiler, which is most popular tool in java performance.
> > and
> > >>> the
> > > arthas and intellij both use it. 
> > > And we support it in our ant group company.
> > >    And Flink supports FLIP-165: Operator's Flame
> > Graphs
> > > now. and it draw flame graph by the front-end
> > > libraries d3-flame-graph, which has some problem in  jobs
> > > of large of parallelism.
> > >    Please be aware that the FLIP wiki area is not fully
> > done
> > > since i don't konw whether it will accept by
> > >> flink community. 
> > >    Feel free to add your thoughts to make this feature
> > >>> better! i
>

Re: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-02-11 Thread David Morávek
There are already tools [1] that simplify this for the user.

I honestly don't know, it feels like it can bring more problems that actual
benefits as this heavily relies on the environment. It can easily break for
some users, eg. because of the kernel settings; their architecture might
not be supported; Also we'd need to go an extra mile regarding the security.

Considering there are already other tools that are specifically designed
for this (such as [1]), I personally don't feel that this should be part of
Flink.

[1] https://github.com/yahoo/kubectl-flame


On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid> wrote:

> Our flink application is on k8s.Yes, user can use the async-profiler
> directly, but it is not convenient for user, who should download the jars
> and need to know how to use it. And some users don’t know the tool.if we
> integrate it, user will benefit a lot.
>
> On 2022/01/26 18:56:17 David Morávek wrote:
> > I'd second to Alex's concerns. Is there a reason why you can't use the
> > async-profiler directly? In what kind of environment are your Flink
> > clusters running (YARN / k8s / ...)?
> >
> > Best,
> > D.
> >
> > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov 
> > wrote:
> >
> >> Hi Jacky,
> >>
> >> Could you please clarify what kind of *problems* you experience with the
> >> large parallelism? You referred to D3, is it something related to
> rendering
> >> on the browser side or is it about the samples collection process? Were
> you
> >> able to identify the bottleneck?
> >>
> >> Fundamentally I have some concerns regarding the proposed approach:
> >> 1. Calling shell scripts triggered via the web UI is a security concern
> and
> >> it needs to be evaluated carefully if it could introduce any unexpected
> >> attack vectors (depending on the implementation, passed parameters etc.)
> >> 2. My understanding is that the async-profiler implementation is
> >> system-dependent. How do you propose to handle multiple architectures?
> >> Would you like to ship each available implementation within Flink? [1]
> >> 3. Do you plan to make use of full async-profiler features including
> native
> >> calls sampling with perf_events? If so, the issue I see is that some
> >> environments restrict ptrace calls by default [2]
> >>
> >> [1] https://github.com/jvm-profiling-tools/async-profiler#download
> >> [2]
> >>
> >>
> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
> >>
> >>
> >> Best,
> >> Alexander Fedulov
> >>
> >> On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
> >>
> >>> This is an expected feature, as we also experienced browser crashes on
> >>> existing operator-level flame graphs
> >>>
> >>> Best,
> >>> Echo Lee
> >>>
>  在 2022年1月24日,下午6:16,David Morávek  写道:
> 
>  Hi Jacky,
> 
>  The link seems to be broken, here is the correct one [1].
> 
>  [1]
> 
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> 
>  Best,
>  D.
> 
> > On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
> >>> wrote:
> >
> > Hi All,
> >     I would like to start the discussion on FLIP-213 <
> >
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs>
> > ;
> >  which aims to provide taskmanager level(process level) flame
> >> graph
> > by async profiler, which is most popular tool in java performance.
> and
> >>> the
> > arthas and intellij both use it. 
> > And we support it in our ant group company.
> >    And Flink supports FLIP-165: Operator's Flame
> Graphs
> > now. and it draw flame graph by the front-end
> > libraries d3-flame-graph, which has some problem in  jobs
> > of large of parallelism.
> >    Please be aware that the FLIP wiki area is not fully
> done
> > since i don't konw whether it will accept by
> >> flink community. 
> >    Feel free to add your thoughts to make this feature
> >>> better! i
> > am looking forward  to all your response. Thanks too much!
> >
> >
> >
> >
> > Best Jacky Lau
> >>>
> >>
>


RE: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-02-11 Thread Jacky Lau
Our flink application is on k8s.Yes, user can use the async-profiler directly, 
but it is not convenient for user, who should download the jars and need to 
know how to use it. And some users don’t know the tool.if we integrate it, user 
will benefit a lot.

On 2022/01/26 18:56:17 David Morávek wrote:
> I'd second to Alex's concerns. Is there a reason why you can't use the
> async-profiler directly? In what kind of environment are your Flink
> clusters running (YARN / k8s / ...)?
> 
> Best,
> D.
> 
> On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov 
> wrote:
> 
>> Hi Jacky,
>> 
>> Could you please clarify what kind of *problems* you experience with the
>> large parallelism? You referred to D3, is it something related to rendering
>> on the browser side or is it about the samples collection process? Were you
>> able to identify the bottleneck?
>> 
>> Fundamentally I have some concerns regarding the proposed approach:
>> 1. Calling shell scripts triggered via the web UI is a security concern and
>> it needs to be evaluated carefully if it could introduce any unexpected
>> attack vectors (depending on the implementation, passed parameters etc.)
>> 2. My understanding is that the async-profiler implementation is
>> system-dependent. How do you propose to handle multiple architectures?
>> Would you like to ship each available implementation within Flink? [1]
>> 3. Do you plan to make use of full async-profiler features including native
>> calls sampling with perf_events? If so, the issue I see is that some
>> environments restrict ptrace calls by default [2]
>> 
>> [1] https://github.com/jvm-profiling-tools/async-profiler#download
>> [2]
>> 
>> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
>> 
>> 
>> Best,
>> Alexander Fedulov
>> 
>> On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
>> 
>>> This is an expected feature, as we also experienced browser crashes on
>>> existing operator-level flame graphs
>>> 
>>> Best,
>>> Echo Lee
>>> 
 在 2022年1月24日,下午6:16,David Morávek  写道:
 
 Hi Jacky,
 
 The link seems to be broken, here is the correct one [1].
 
 [1]
 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
 
 Best,
 D.
 
> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
>>> wrote:
> 
> Hi All,
>     I would like to start the discussion on FLIP-213 <
> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs>
> ;
>  which aims to provide taskmanager level(process level) flame
>> graph
> by async profiler, which is most popular tool in java performance. and
>>> the
> arthas and intellij both use it. 
> And we support it in our ant group company.
>    And Flink supports FLIP-165: Operator's Flame Graphs
> now. and it draw flame graph by the front-end
> libraries d3-flame-graph, which has some problem in  jobs
> of large of parallelism.
>    Please be aware that the FLIP wiki area is not fully done
> since i don't konw whether it will accept by
>> flink community. 
>    Feel free to add your thoughts to make this feature
>>> better! i
> am looking forward  to all your response. Thanks too much!
> 
> 
> 
> 
> Best Jacky Lau
>>> 
>> 


RE: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-02-11 Thread Jacky Lau
Our flink application is on k8s.Yes, user can use the async-profiler directly, 
but it is not convenient for user, who should download the jars and need to 
know how to use it. And some users don’t know the tool.if we integrate it, user 
will benefit a lot.

On 2022/01/26 18:56:17 David Morávek wrote:
> I'd second to Alex's concerns. Is there a reason why you can't use the
> async-profiler directly? In what kind of environment are your Flink
> clusters running (YARN / k8s / ...)?
> 
> Best,
> D.
> 
> On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov 
> wrote:
> 
>> Hi Jacky,
>> 
>> Could you please clarify what kind of *problems* you experience with the
>> large parallelism? You referred to D3, is it something related to rendering
>> on the browser side or is it about the samples collection process? Were you
>> able to identify the bottleneck?
>> 
>> Fundamentally I have some concerns regarding the proposed approach:
>> 1. Calling shell scripts triggered via the web UI is a security concern and
>> it needs to be evaluated carefully if it could introduce any unexpected
>> attack vectors (depending on the implementation, passed parameters etc.)
>> 2. My understanding is that the async-profiler implementation is
>> system-dependent. How do you propose to handle multiple architectures?
>> Would you like to ship each available implementation within Flink? [1]
>> 3. Do you plan to make use of full async-profiler features including native
>> calls sampling with perf_events? If so, the issue I see is that some
>> environments restrict ptrace calls by default [2]
>> 
>> [1] https://github.com/jvm-profiling-tools/async-profiler#download
>> [2]
>> 
>> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
>> 
>> 
>> Best,
>> Alexander Fedulov
>> 
>> On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
>> 
>>> This is an expected feature, as we also experienced browser crashes on
>>> existing operator-level flame graphs
>>> 
>>> Best,
>>> Echo Lee
>>> 
 在 2022年1月24日,下午6:16,David Morávek  写道:
 
 Hi Jacky,
 
 The link seems to be broken, here is the correct one [1].
 
 [1]
 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
 
 Best,
 D.
 
> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
>>> wrote:
> 
> Hi All,
>     I would like to start the discussion on FLIP-213 <
> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs>
> ;
>  which aims to provide taskmanager level(process level) flame
>> graph
> by async profiler, which is most popular tool in java performance. and
>>> the
> arthas and intellij both use it. 
> And we support it in our ant group company.
>    And Flink supports FLIP-165: Operator's Flame Graphs
> now. and it draw flame graph by the front-end
> libraries d3-flame-graph, which has some problem in  jobs
> of large of parallelism.
>    Please be aware that the FLIP wiki area is not fully done
> since i don't konw whether it will accept by
>> flink community. 
>    Feel free to add your thoughts to make this feature
>>> better! i
> am looking forward  to all your response. Thanks too much!
> 
> 
> 
> 
> Best Jacky Lau
>>> 
>> 


RE: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-02-11 Thread Jacky Lau
Our flink application is on k8s.Yes, user can use the async-profiler directly, 
but it is not convenient for user, who should download the jars and need to 
know how to use it. And some users don’t know the tool.if we integrate it, user 
will benefit a lot.

On 2022/01/26 18:56:17 David Morávek wrote:
> I'd second to Alex's concerns. Is there a reason why you can't use the
> async-profiler directly? In what kind of environment are your Flink
> clusters running (YARN / k8s / ...)?
> 
> Best,
> D.
> 
> On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov 
> wrote:
> 
> > Hi Jacky,
> >
> > Could you please clarify what kind of *problems* you experience with the
> > large parallelism? You referred to D3, is it something related to rendering
> > on the browser side or is it about the samples collection process? Were you
> > able to identify the bottleneck?
> >
> > Fundamentally I have some concerns regarding the proposed approach:
> > 1. Calling shell scripts triggered via the web UI is a security concern and
> > it needs to be evaluated carefully if it could introduce any unexpected
> > attack vectors (depending on the implementation, passed parameters etc.)
> > 2. My understanding is that the async-profiler implementation is
> > system-dependent. How do you propose to handle multiple architectures?
> > Would you like to ship each available implementation within Flink? [1]
> > 3. Do you plan to make use of full async-profiler features including native
> > calls sampling with perf_events? If so, the issue I see is that some
> > environments restrict ptrace calls by default [2]
> >
> > [1] https://github.com/jvm-profiling-tools/async-profiler#download
> > [2]
> >
> > https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
> >
> >
> > Best,
> > Alexander Fedulov
> >
> > On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
> >
> > > This is an expected feature, as we also experienced browser crashes on
> > > existing operator-level flame graphs
> > >
> > > Best,
> > > Echo Lee
> > >
> > > > 在 2022年1月24日,下午6:16,David Morávek  写道:
> > > >
> > > > Hi Jacky,
> > > >
> > > > The link seems to be broken, here is the correct one [1].
> > > >
> > > > [1]
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > > >
> > > > Best,
> > > > D.
> > > >
> > > >> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
> > > wrote:
> > > >>
> > > >> Hi All,
> > > >>     I would like to start the discussion on FLIP-213 <
> > > >>
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs>
> > > >> ;
> > > >>  which aims to provide taskmanager level(process level) flame
> > graph
> > > >> by async profiler, which is most popular tool in java performance. and
> > > the
> > > >> arthas and intellij both use it. 
> > > >> And we support it in our ant group company.
> > > >>    And Flink supports FLIP-165: Operator's Flame Graphs
> > > >> now. and it draw flame graph by the front-end
> > > >> libraries d3-flame-graph, which has some problem in  jobs
> > > >> of large of parallelism.
> > > >>    Please be aware that the FLIP wiki area is not fully done
> > > >> since i don't konw whether it will accept by
> > flink community. 
> > > >>    Feel free to add your thoughts to make this feature
> > > better! i
> > > >> am looking forward  to all your response. Thanks too much!
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Best Jacky Lau
> > >
> >
> 

RE: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-02-10 Thread Jacky Lau
Hi Alexander:
   Sorry for late response for Chinese Spring Festival.
   The bottleneck is rendering on the browser side.
   For 1) we support user define script capability like yarn. And the flame 
graph script just encapsulate async profiler. So we should make it secure.
   For 2) yeah, we use different async profiler package for  different  
architectures.
   For 3) may not

On 2022/01/26 15:24:51 Alexander Fedulov wrote:
> Hi Jacky,
> 
> Could you please clarify what kind of *problems* you experience with the
> large parallelism? You referred to D3, is it something related to rendering
> on the browser side or is it about the samples collection process? Were you
> able to identify the bottleneck?
> 
> Fundamentally I have some concerns regarding the proposed approach:
> 1. Calling shell scripts triggered via the web UI is a security concern and
> it needs to be evaluated carefully if it could introduce any unexpected
> attack vectors (depending on the implementation, passed parameters etc.)
> 2. My understanding is that the async-profiler implementation is
> system-dependent. How do you propose to handle multiple architectures?
> Would you like to ship each available implementation within Flink? [1]
> 3. Do you plan to make use of full async-profiler features including native
> calls sampling with perf_events? If so, the issue I see is that some
> environments restrict ptrace calls by default [2]
> 
> [1] https://github.com/jvm-profiling-tools/async-profiler#download
> [2]
> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
> 
> 
> Best,
> Alexander Fedulov
> 
> On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
> 
> > This is an expected feature, as we also experienced browser crashes on
> > existing operator-level flame graphs
> >
> > Best,
> > Echo Lee
> >
> > > 在 2022年1月24日,下午6:16,David Morávek  写道:
> > >
> > > Hi Jacky,
> > >
> > > The link seems to be broken, here is the correct one [1].
> > >
> > > [1]
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > >
> > > Best,
> > > D.
> > >
> > >> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
> > wrote:
> > >>
> > >> Hi All,
> > >>     I would like to start the discussion on FLIP-213 <
> > >>
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs>
> > >> ;
> > >>  which aims to provide taskmanager level(process level) flame graph
> > >> by async profiler, which is most popular tool in java performance. and
> > the
> > >> arthas and intellij both use it. 
> > >> And we support it in our ant group company.
> > >>    And Flink supports FLIP-165: Operator's Flame Graphs
> > >> now. and it draw flame graph by the front-end
> > >> libraries d3-flame-graph, which has some problem in  jobs
> > >> of large of parallelism.
> > >>    Please be aware that the FLIP wiki area is not fully done
> > >> since i don't konw whether it will accept by flink community. 
> > >>    Feel free to add your thoughts to make this feature
> > better! i
> > >> am looking forward  to all your response. Thanks too much!
> > >>
> > >>
> > >>
> > >>
> > >> Best Jacky Lau
> >
>