Re: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs
Hi Jacky, Apart from the continuous discussions, I think you can share current progress or even make it public as one of the flink-packages [1] [1] https://flink-packages.org/ Best Yun Tang From: Jacky Lau <281293...@qq.com.INVALID> Sent: Friday, February 11, 2022 10:11 To: dev@flink.apache.org Subject: RE: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs Our flink application is on k8s.Yes, user can use the async-profiler directly, but it is not convenient for user, who should download the jars and need to know how to use it. And some users don’t know the tool.if we integrate it, user will benefit a lot. On 2022/01/26 18:56:17 David Morávek wrote: > I'd second to Alex's concerns. Is there a reason why you can't use the > async-profiler directly? In what kind of environment are your Flink > clusters running (YARN / k8s / ...)? > > Best, > D. > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov > wrote: > >> Hi Jacky, >> >> Could you please clarify what kind of *problems* you experience with the >> large parallelism? You referred to D3, is it something related to rendering >> on the browser side or is it about the samples collection process? Were you >> able to identify the bottleneck? >> >> Fundamentally I have some concerns regarding the proposed approach: >> 1. Calling shell scripts triggered via the web UI is a security concern and >> it needs to be evaluated carefully if it could introduce any unexpected >> attack vectors (depending on the implementation, passed parameters etc.) >> 2. My understanding is that the async-profiler implementation is >> system-dependent. How do you propose to handle multiple architectures? >> Would you like to ship each available implementation within Flink? [1] >> 3. Do you plan to make use of full async-profiler features including native >> calls sampling with perf_events? If so, the issue I see is that some >> environments restrict ptrace calls by default [2] >> >> [1] https://github.com/jvm-profiling-tools/async-profiler#download >> [2] >> >> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces >> >> >> Best, >> Alexander Fedulov >> >> On Wed, Jan 26, 2022 at 1:59 PM 李森 wrote: >> >>> This is an expected feature, as we also experienced browser crashes on >>> existing operator-level flame graphs >>> >>> Best, >>> Echo Lee >>> >>>> 在 2022年1月24日,下午6:16,David Morávek 写道: >>>> >>>> Hi Jacky, >>>> >>>> The link seems to be broken, here is the correct one [1]. >>>> >>>> [1] >>>> >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs >>>> >>>> Best, >>>> D. >>>> >>>>> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid> >>> wrote: >>>>> >>>>> Hi All, >>>>> I would like to start the discussion on FLIP-213 < >>>>> >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs> >>>>> ; >>>>> which aims to provide taskmanager level(process level) flame >> graph >>>>> by async profiler, which is most popular tool in java performance. and >>> the >>>>> arthas and intellij both use it. >>>>> And we support it in our ant group company. >>>>> And Flink supports FLIP-165: Operator's Flame Graphs >>>>> now. and it draw flame graph by the front-end >>>>> libraries d3-flame-graph, which has some problem in jobs >>>>> of large of parallelism. >>>>> Please be aware that the FLIP wiki area is not fully done >>>>> since i don't konw whether it will accept by >> flink community. >>>>> Feel free to add your thoughts to make this feature >>> better! i >>>>> am looking forward to all your response. Thanks too much! >>>>> >>>>> >>>>> >>>>> >>>>> Best Jacky Lau >>> >>
Re: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs
Pyroscope[1] and Parca[2] are other options for less-intrusive profiling (& great fits for k8s) that move the burden from Flink & its UI to tools that are purpose-built for this use case. Perhaps we could investigate what it would take (if anything) to make Flink compatible with those? Best, Austin [1]: https://pyroscope.io/ [2]: https://www.parca.dev/ On Fri, Feb 11, 2022 at 8:33 AM Alexander Fedulov wrote: > Are you sure the UI is the bottleneck? The UI gets back a JSON > representation of this data structure: > > https://github.com/apache/flink/blob/2e21321f9c9d9aada7e4ad8ca90d915c34f58015/flink-runtime/src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexFlameGraph.java > > > All samples from the individual subtasks get merged in the backend, the UI > just renders this one data structure. The complexity is *O(s)*, where *s* > is the number of elements on the stack, not *O(s*n)* where *n* is the > number of subtasks. Since all subtasks execute the same code, *s* is > expected to be stable regardless of the parallelism. > > Best, > Alexander Fedulov > > On Fri, Feb 11, 2022 at 11:01 AM David Morávek wrote: > > > There are already tools [1] that simplify this for the user. > > > > I honestly don't know, it feels like it can bring more problems that > actual > > benefits as this heavily relies on the environment. It can easily break > for > > some users, eg. because of the kernel settings; their architecture might > > not be supported; Also we'd need to go an extra mile regarding the > > security. > > > > Considering there are already other tools that are specifically designed > > for this (such as [1]), I personally don't feel that this should be part > of > > Flink. > > > > [1] https://github.com/yahoo/kubectl-flame > > > > > > On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid> > > wrote: > > > > > Our flink application is on k8s.Yes, user can use the async-profiler > > > directly, but it is not convenient for user, who should download the > jars > > > and need to know how to use it. And some users don’t know the tool.if > we > > > integrate it, user will benefit a lot. > > > > > > On 2022/01/26 18:56:17 David Morávek wrote: > > > > I'd second to Alex's concerns. Is there a reason why you can't use > the > > > > async-profiler directly? In what kind of environment are your Flink > > > > clusters running (YARN / k8s / ...)? > > > > > > > > Best, > > > > D. > > > > > > > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov < > al...@ververica.com > > > > > > > wrote: > > > > > > > >> Hi Jacky, > > > >> > > > >> Could you please clarify what kind of *problems* you experience with > > the > > > >> large parallelism? You referred to D3, is it something related to > > > rendering > > > >> on the browser side or is it about the samples collection process? > > Were > > > you > > > >> able to identify the bottleneck? > > > >> > > > >> Fundamentally I have some concerns regarding the proposed approach: > > > >> 1. Calling shell scripts triggered via the web UI is a security > > concern > > > and > > > >> it needs to be evaluated carefully if it could introduce any > > unexpected > > > >> attack vectors (depending on the implementation, passed parameters > > etc.) > > > >> 2. My understanding is that the async-profiler implementation is > > > >> system-dependent. How do you propose to handle multiple > architectures? > > > >> Would you like to ship each available implementation within Flink? > [1] > > > >> 3. Do you plan to make use of full async-profiler features including > > > native > > > >> calls sampling with perf_events? If so, the issue I see is that some > > > >> environments restrict ptrace calls by default [2] > > > >> > > > >> [1] https://github.com/jvm-profiling-tools/async-profiler#download > > > >> [2] > > > >> > > > >> > > > > > > https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces > > > >> > > > >> > > > >> Best, > > > >> Alexander Fedulov > > > >> > > > >> On Wed, Jan 26, 2022 at 1:59 PM 李森 > wrote: > > > >> > > > >>> This is an expected feature, as we also experienced browser crashes > > on > > > >>> existing operator-level flame graphs > > > >>> > > > >>> Best, > > > >>> Echo Lee > > > >>> > > > 在 2022年1月24日,下午6:16,David Morávek 写道: > > > > > > Hi Jacky, > > > > > > The link seems to be broken, here is the correct one [1]. > > > > > > [1] > > > > > > >>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs > > > > > > Best, > > > D. > > > > > > > On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid> > > > >>> wrote: > > > > > > > > Hi All, > > > > I would like to start the discussion on FLIP-213 < > > > > > > > >>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs> > > > > ; > > > > which aims to provide
Re: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs
Are you sure the UI is the bottleneck? The UI gets back a JSON representation of this data structure: https://github.com/apache/flink/blob/2e21321f9c9d9aada7e4ad8ca90d915c34f58015/flink-runtime/src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexFlameGraph.java All samples from the individual subtasks get merged in the backend, the UI just renders this one data structure. The complexity is *O(s)*, where *s* is the number of elements on the stack, not *O(s*n)* where *n* is the number of subtasks. Since all subtasks execute the same code, *s* is expected to be stable regardless of the parallelism. Best, Alexander Fedulov On Fri, Feb 11, 2022 at 11:01 AM David Morávek wrote: > There are already tools [1] that simplify this for the user. > > I honestly don't know, it feels like it can bring more problems that actual > benefits as this heavily relies on the environment. It can easily break for > some users, eg. because of the kernel settings; their architecture might > not be supported; Also we'd need to go an extra mile regarding the > security. > > Considering there are already other tools that are specifically designed > for this (such as [1]), I personally don't feel that this should be part of > Flink. > > [1] https://github.com/yahoo/kubectl-flame > > > On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid> > wrote: > > > Our flink application is on k8s.Yes, user can use the async-profiler > > directly, but it is not convenient for user, who should download the jars > > and need to know how to use it. And some users don’t know the tool.if we > > integrate it, user will benefit a lot. > > > > On 2022/01/26 18:56:17 David Morávek wrote: > > > I'd second to Alex's concerns. Is there a reason why you can't use the > > > async-profiler directly? In what kind of environment are your Flink > > > clusters running (YARN / k8s / ...)? > > > > > > Best, > > > D. > > > > > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov > > > > wrote: > > > > > >> Hi Jacky, > > >> > > >> Could you please clarify what kind of *problems* you experience with > the > > >> large parallelism? You referred to D3, is it something related to > > rendering > > >> on the browser side or is it about the samples collection process? > Were > > you > > >> able to identify the bottleneck? > > >> > > >> Fundamentally I have some concerns regarding the proposed approach: > > >> 1. Calling shell scripts triggered via the web UI is a security > concern > > and > > >> it needs to be evaluated carefully if it could introduce any > unexpected > > >> attack vectors (depending on the implementation, passed parameters > etc.) > > >> 2. My understanding is that the async-profiler implementation is > > >> system-dependent. How do you propose to handle multiple architectures? > > >> Would you like to ship each available implementation within Flink? [1] > > >> 3. Do you plan to make use of full async-profiler features including > > native > > >> calls sampling with perf_events? If so, the issue I see is that some > > >> environments restrict ptrace calls by default [2] > > >> > > >> [1] https://github.com/jvm-profiling-tools/async-profiler#download > > >> [2] > > >> > > >> > > > https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces > > >> > > >> > > >> Best, > > >> Alexander Fedulov > > >> > > >> On Wed, Jan 26, 2022 at 1:59 PM 李森 wrote: > > >> > > >>> This is an expected feature, as we also experienced browser crashes > on > > >>> existing operator-level flame graphs > > >>> > > >>> Best, > > >>> Echo Lee > > >>> > > 在 2022年1月24日,下午6:16,David Morávek 写道: > > > > Hi Jacky, > > > > The link seems to be broken, here is the correct one [1]. > > > > [1] > > > > >>> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs > > > > Best, > > D. > > > > > On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid> > > >>> wrote: > > > > > > Hi All, > > > I would like to start the discussion on FLIP-213 < > > > > > >>> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs> > > > ; > > > which aims to provide taskmanager level(process level) flame > > >> graph > > > by async profiler, which is most popular tool in java performance. > > and > > >>> the > > > arthas and intellij both use it. > > > And we support it in our ant group company. > > > And Flink supports FLIP-165: Operator's Flame > > Graphs > > > now. and it draw flame graph by the front-end > > > libraries d3-flame-graph, which has some problem in jobs > > > of large of parallelism. > > > Please be aware that the FLIP wiki area is not fully > > done > > > since i don't konw whether it will accept by > > >> flink community. > > > Feel free to add your thoughts to make this feature > > >>> better! i >
Re: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs
There are already tools [1] that simplify this for the user. I honestly don't know, it feels like it can bring more problems that actual benefits as this heavily relies on the environment. It can easily break for some users, eg. because of the kernel settings; their architecture might not be supported; Also we'd need to go an extra mile regarding the security. Considering there are already other tools that are specifically designed for this (such as [1]), I personally don't feel that this should be part of Flink. [1] https://github.com/yahoo/kubectl-flame On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid> wrote: > Our flink application is on k8s.Yes, user can use the async-profiler > directly, but it is not convenient for user, who should download the jars > and need to know how to use it. And some users don’t know the tool.if we > integrate it, user will benefit a lot. > > On 2022/01/26 18:56:17 David Morávek wrote: > > I'd second to Alex's concerns. Is there a reason why you can't use the > > async-profiler directly? In what kind of environment are your Flink > > clusters running (YARN / k8s / ...)? > > > > Best, > > D. > > > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov > > wrote: > > > >> Hi Jacky, > >> > >> Could you please clarify what kind of *problems* you experience with the > >> large parallelism? You referred to D3, is it something related to > rendering > >> on the browser side or is it about the samples collection process? Were > you > >> able to identify the bottleneck? > >> > >> Fundamentally I have some concerns regarding the proposed approach: > >> 1. Calling shell scripts triggered via the web UI is a security concern > and > >> it needs to be evaluated carefully if it could introduce any unexpected > >> attack vectors (depending on the implementation, passed parameters etc.) > >> 2. My understanding is that the async-profiler implementation is > >> system-dependent. How do you propose to handle multiple architectures? > >> Would you like to ship each available implementation within Flink? [1] > >> 3. Do you plan to make use of full async-profiler features including > native > >> calls sampling with perf_events? If so, the issue I see is that some > >> environments restrict ptrace calls by default [2] > >> > >> [1] https://github.com/jvm-profiling-tools/async-profiler#download > >> [2] > >> > >> > https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces > >> > >> > >> Best, > >> Alexander Fedulov > >> > >> On Wed, Jan 26, 2022 at 1:59 PM 李森 wrote: > >> > >>> This is an expected feature, as we also experienced browser crashes on > >>> existing operator-level flame graphs > >>> > >>> Best, > >>> Echo Lee > >>> > 在 2022年1月24日,下午6:16,David Morávek 写道: > > Hi Jacky, > > The link seems to be broken, here is the correct one [1]. > > [1] > > >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs > > Best, > D. > > > On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid> > >>> wrote: > > > > Hi All, > > I would like to start the discussion on FLIP-213 < > > > >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs> > > ; > > which aims to provide taskmanager level(process level) flame > >> graph > > by async profiler, which is most popular tool in java performance. > and > >>> the > > arthas and intellij both use it. > > And we support it in our ant group company. > > And Flink supports FLIP-165: Operator's Flame > Graphs > > now. and it draw flame graph by the front-end > > libraries d3-flame-graph, which has some problem in jobs > > of large of parallelism. > > Please be aware that the FLIP wiki area is not fully > done > > since i don't konw whether it will accept by > >> flink community. > > Feel free to add your thoughts to make this feature > >>> better! i > > am looking forward to all your response. Thanks too much! > > > > > > > > > > Best Jacky Lau > >>> > >> >
RE: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs
Our flink application is on k8s.Yes, user can use the async-profiler directly, but it is not convenient for user, who should download the jars and need to know how to use it. And some users don’t know the tool.if we integrate it, user will benefit a lot. On 2022/01/26 18:56:17 David Morávek wrote: > I'd second to Alex's concerns. Is there a reason why you can't use the > async-profiler directly? In what kind of environment are your Flink > clusters running (YARN / k8s / ...)? > > Best, > D. > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov > wrote: > >> Hi Jacky, >> >> Could you please clarify what kind of *problems* you experience with the >> large parallelism? You referred to D3, is it something related to rendering >> on the browser side or is it about the samples collection process? Were you >> able to identify the bottleneck? >> >> Fundamentally I have some concerns regarding the proposed approach: >> 1. Calling shell scripts triggered via the web UI is a security concern and >> it needs to be evaluated carefully if it could introduce any unexpected >> attack vectors (depending on the implementation, passed parameters etc.) >> 2. My understanding is that the async-profiler implementation is >> system-dependent. How do you propose to handle multiple architectures? >> Would you like to ship each available implementation within Flink? [1] >> 3. Do you plan to make use of full async-profiler features including native >> calls sampling with perf_events? If so, the issue I see is that some >> environments restrict ptrace calls by default [2] >> >> [1] https://github.com/jvm-profiling-tools/async-profiler#download >> [2] >> >> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces >> >> >> Best, >> Alexander Fedulov >> >> On Wed, Jan 26, 2022 at 1:59 PM 李森 wrote: >> >>> This is an expected feature, as we also experienced browser crashes on >>> existing operator-level flame graphs >>> >>> Best, >>> Echo Lee >>> 在 2022年1月24日,下午6:16,David Morávek 写道: Hi Jacky, The link seems to be broken, here is the correct one [1]. [1] >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs Best, D. > On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid> >>> wrote: > > Hi All, > I would like to start the discussion on FLIP-213 < > >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs> > ; > which aims to provide taskmanager level(process level) flame >> graph > by async profiler, which is most popular tool in java performance. and >>> the > arthas and intellij both use it. > And we support it in our ant group company. > And Flink supports FLIP-165: Operator's Flame Graphs > now. and it draw flame graph by the front-end > libraries d3-flame-graph, which has some problem in jobs > of large of parallelism. > Please be aware that the FLIP wiki area is not fully done > since i don't konw whether it will accept by >> flink community. > Feel free to add your thoughts to make this feature >>> better! i > am looking forward to all your response. Thanks too much! > > > > > Best Jacky Lau >>> >>
RE: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs
Our flink application is on k8s.Yes, user can use the async-profiler directly, but it is not convenient for user, who should download the jars and need to know how to use it. And some users don’t know the tool.if we integrate it, user will benefit a lot. On 2022/01/26 18:56:17 David Morávek wrote: > I'd second to Alex's concerns. Is there a reason why you can't use the > async-profiler directly? In what kind of environment are your Flink > clusters running (YARN / k8s / ...)? > > Best, > D. > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov > wrote: > >> Hi Jacky, >> >> Could you please clarify what kind of *problems* you experience with the >> large parallelism? You referred to D3, is it something related to rendering >> on the browser side or is it about the samples collection process? Were you >> able to identify the bottleneck? >> >> Fundamentally I have some concerns regarding the proposed approach: >> 1. Calling shell scripts triggered via the web UI is a security concern and >> it needs to be evaluated carefully if it could introduce any unexpected >> attack vectors (depending on the implementation, passed parameters etc.) >> 2. My understanding is that the async-profiler implementation is >> system-dependent. How do you propose to handle multiple architectures? >> Would you like to ship each available implementation within Flink? [1] >> 3. Do you plan to make use of full async-profiler features including native >> calls sampling with perf_events? If so, the issue I see is that some >> environments restrict ptrace calls by default [2] >> >> [1] https://github.com/jvm-profiling-tools/async-profiler#download >> [2] >> >> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces >> >> >> Best, >> Alexander Fedulov >> >> On Wed, Jan 26, 2022 at 1:59 PM 李森 wrote: >> >>> This is an expected feature, as we also experienced browser crashes on >>> existing operator-level flame graphs >>> >>> Best, >>> Echo Lee >>> 在 2022年1月24日,下午6:16,David Morávek 写道: Hi Jacky, The link seems to be broken, here is the correct one [1]. [1] >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs Best, D. > On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid> >>> wrote: > > Hi All, > I would like to start the discussion on FLIP-213 < > >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs> > ; > which aims to provide taskmanager level(process level) flame >> graph > by async profiler, which is most popular tool in java performance. and >>> the > arthas and intellij both use it. > And we support it in our ant group company. > And Flink supports FLIP-165: Operator's Flame Graphs > now. and it draw flame graph by the front-end > libraries d3-flame-graph, which has some problem in jobs > of large of parallelism. > Please be aware that the FLIP wiki area is not fully done > since i don't konw whether it will accept by >> flink community. > Feel free to add your thoughts to make this feature >>> better! i > am looking forward to all your response. Thanks too much! > > > > > Best Jacky Lau >>> >>
RE: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs
Our flink application is on k8s.Yes, user can use the async-profiler directly, but it is not convenient for user, who should download the jars and need to know how to use it. And some users don’t know the tool.if we integrate it, user will benefit a lot. On 2022/01/26 18:56:17 David Morávek wrote: > I'd second to Alex's concerns. Is there a reason why you can't use the > async-profiler directly? In what kind of environment are your Flink > clusters running (YARN / k8s / ...)? > > Best, > D. > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov > wrote: > > > Hi Jacky, > > > > Could you please clarify what kind of *problems* you experience with the > > large parallelism? You referred to D3, is it something related to rendering > > on the browser side or is it about the samples collection process? Were you > > able to identify the bottleneck? > > > > Fundamentally I have some concerns regarding the proposed approach: > > 1. Calling shell scripts triggered via the web UI is a security concern and > > it needs to be evaluated carefully if it could introduce any unexpected > > attack vectors (depending on the implementation, passed parameters etc.) > > 2. My understanding is that the async-profiler implementation is > > system-dependent. How do you propose to handle multiple architectures? > > Would you like to ship each available implementation within Flink? [1] > > 3. Do you plan to make use of full async-profiler features including native > > calls sampling with perf_events? If so, the issue I see is that some > > environments restrict ptrace calls by default [2] > > > > [1] https://github.com/jvm-profiling-tools/async-profiler#download > > [2] > > > > https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces > > > > > > Best, > > Alexander Fedulov > > > > On Wed, Jan 26, 2022 at 1:59 PM 李森 wrote: > > > > > This is an expected feature, as we also experienced browser crashes on > > > existing operator-level flame graphs > > > > > > Best, > > > Echo Lee > > > > > > > 在 2022年1月24日,下午6:16,David Morávek 写道: > > > > > > > > Hi Jacky, > > > > > > > > The link seems to be broken, here is the correct one [1]. > > > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs > > > > > > > > Best, > > > > D. > > > > > > > >> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid> > > > wrote: > > > >> > > > >> Hi All, > > > >> I would like to start the discussion on FLIP-213 < > > > >> > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs> > > > >> ; > > > >> which aims to provide taskmanager level(process level) flame > > graph > > > >> by async profiler, which is most popular tool in java performance. and > > > the > > > >> arthas and intellij both use it. > > > >> And we support it in our ant group company. > > > >> And Flink supports FLIP-165: Operator's Flame Graphs > > > >> now. and it draw flame graph by the front-end > > > >> libraries d3-flame-graph, which has some problem in jobs > > > >> of large of parallelism. > > > >> Please be aware that the FLIP wiki area is not fully done > > > >> since i don't konw whether it will accept by > > flink community. > > > >> Feel free to add your thoughts to make this feature > > > better! i > > > >> am looking forward to all your response. Thanks too much! > > > >> > > > >> > > > >> > > > >> > > > >> Best Jacky Lau > > > > > >
RE: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs
Hi Alexander: Sorry for late response for Chinese Spring Festival. The bottleneck is rendering on the browser side. For 1) we support user define script capability like yarn. And the flame graph script just encapsulate async profiler. So we should make it secure. For 2) yeah, we use different async profiler package for different architectures. For 3) may not On 2022/01/26 15:24:51 Alexander Fedulov wrote: > Hi Jacky, > > Could you please clarify what kind of *problems* you experience with the > large parallelism? You referred to D3, is it something related to rendering > on the browser side or is it about the samples collection process? Were you > able to identify the bottleneck? > > Fundamentally I have some concerns regarding the proposed approach: > 1. Calling shell scripts triggered via the web UI is a security concern and > it needs to be evaluated carefully if it could introduce any unexpected > attack vectors (depending on the implementation, passed parameters etc.) > 2. My understanding is that the async-profiler implementation is > system-dependent. How do you propose to handle multiple architectures? > Would you like to ship each available implementation within Flink? [1] > 3. Do you plan to make use of full async-profiler features including native > calls sampling with perf_events? If so, the issue I see is that some > environments restrict ptrace calls by default [2] > > [1] https://github.com/jvm-profiling-tools/async-profiler#download > [2] > https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces > > > Best, > Alexander Fedulov > > On Wed, Jan 26, 2022 at 1:59 PM 李森 wrote: > > > This is an expected feature, as we also experienced browser crashes on > > existing operator-level flame graphs > > > > Best, > > Echo Lee > > > > > 在 2022年1月24日,下午6:16,David Morávek 写道: > > > > > > Hi Jacky, > > > > > > The link seems to be broken, here is the correct one [1]. > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs > > > > > > Best, > > > D. > > > > > >> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid> > > wrote: > > >> > > >> Hi All, > > >> I would like to start the discussion on FLIP-213 < > > >> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs> > > >> ; > > >> which aims to provide taskmanager level(process level) flame graph > > >> by async profiler, which is most popular tool in java performance. and > > the > > >> arthas and intellij both use it. > > >> And we support it in our ant group company. > > >> And Flink supports FLIP-165: Operator's Flame Graphs > > >> now. and it draw flame graph by the front-end > > >> libraries d3-flame-graph, which has some problem in jobs > > >> of large of parallelism. > > >> Please be aware that the FLIP wiki area is not fully done > > >> since i don't konw whether it will accept by flink community. > > >> Feel free to add your thoughts to make this feature > > better! i > > >> am looking forward to all your response. Thanks too much! > > >> > > >> > > >> > > >> > > >> Best Jacky Lau > > >